=Paper=
{{Paper
|id=Vol-126/paper-3
|storemode=property
|title=DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets
|pdfUrl=https://ceur-ws.org/Vol-126/lucchese.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/LuccheseOP04
}}
==DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets==
DCI Closed: a Fast and Memory Efficient Algorithm
to Mine Frequent Closed Itemsets
Claudio Lucchese Salvatore Orlando
ISTI “A. Faedo” Computer Science Dept.
Consiglio Nazionale delle Ricerche (CNR) Università Ca’ Foscari
Pisa, Italy. Venezia, Italy.
email: claudio.lucchese@isti.cnr.it email: orlando@dsi.unive.it
Raffaele Perego
ISTI “A. Faedo”
Consiglio Nazionale delle Ricerche (CNR)
Pisa, Italy.
email: raffaere.perego@isti.cnr.it
Abstract cover all the itemsets having support higher than (or
equal to) a given threshold min supp.
One of the main problems raising up in the frequent The paper is organized as follows. In Sect. 2 we in-
closed itemsets mining problem is the duplicate detection. troduce closed itemsets and describe a framework for
In this paper we propose a general technique for promptly mining them. This framework is shared by all the algo-
detecting and discarding duplicate closed itemsets, with- rithms surveyed in Sect. 3. In Sect. 4 we formalize the
out the need of keeping in the main memory the whole set problem of duplicates and propose our technique. Sec-
of closed patterns. tion 5 proposes an implementation of our technique
Our approach can be exploited with substantial perfor- and discusses the experimental results obtained. Fol-
mance benefits by any algorithm that adopts a vertical low some concluding remarks.
representation of the dataset. We implemented our tech-
nique within a new depth-first closed itemsets mining al- 2. Closed itemsets
gorithm. The experimental evaluation demonstrates that
our algorithm outperforms other state of the art algo- The concept of closed itemset is based on the two
rithms like CLOSET+ and FPCLOSE. following functions f and g:
f (T ) = {i ∈ I | ∀t ∈ T, i ∈ t}
g(I) = {t ∈ D | ∀i ∈ I, i ∈ t}
1. Introduction
where T and I, T ⊆ D and I ⊆ I are, respectively,
Frequent itemsets mining is the most important and subsets of all the transactions and items occurring in
demanding task in many data mining applications. To dataset D. Function f returns the set of itemsets in-
describe the mining problem we introduce the follow- cluded in all the transactions in T , while function g re-
ing notation. Let I = {a1 , ..., aM } be a finite set of turns the set of transactions supporting a given item-
items, and D a finite set of transactions (the dataset), set I. Since the set of transaction g(I) can be repre-
where each transaction t ∈ D is a list of distinct items sented by a list of transaction identifiers, we refer to
t = {i0 , i1 , ..., iT }, ij ∈ I. A k-itemset is a sequence of k g(I) as the tid-list of I. We can introduce the follow-
distinct items I = {i0 , i1 , ..., ik } | ij ∈ I, sorted on the ing definition:
basis of some total order relation between item literals.
The number of transactions in the dataset including an Definition 1 An itemset I is said to be closed if and
itemset I is defined as the support of I (or supp(I)). only if
Mining all the frequent itemsets from D requires to dis- c(I) = f (g(I)) = f ◦ g(I) = I
where the composite function f ◦ g is called Galois oper- The algorithms for mining frequent closed itemsets
ator or closure operator. adopt a strategy based on two main steps: Search space
browsing, and Closure calculation. In fact, they browse
the search space by traversing the lattice of frequent
The closure operator defines a set of equivalence itemsets from an equivalence class to another, and they
classes over the lattice of frequent itemsets: two item- calculate the closure of the frequent itemsets visited in
sets belong to the same equivalence class iff they have order to determine the maximal elements (closed item-
the same closure, i.e. they are supported by the same sets) of the corresponding equivalence classes. Let us
set of transactions. We can also show that an itemset I analyze in some depth these two phases.
is closed if no superset of I with the same support ex-
Browsing the search space. The goal of an effec-
ists. Thus, a closed itemset is also the maximal itemset
tive browsing strategy should be to devise a spanning
of an equivalence class. Mining all these maximal ele-
tree over the lattice of frequent itemsets, visiting ex-
ments of each equivalence class corresponds to mine all
actly a single itemset in each equivalence class. We
the closed itemsets.
could in fact mine all the closed itemsets by calculat-
Figure 1.(a) shows the lattice of frequent itemsets ing the closure of just an itemset per equivalence class.
derived from the simple dataset reported in Fig. 1.(b), Let us call the itemsets used to compute closures dur-
mined with min supp = 1. We can see that the item- ing the visit closure generators.
sets with the same closure are grouped in the same Some algorithms choose the minimal elements (or
equivalence class. Each equivalence class contains ele- key patterns) of each equivalence class as closure gen-
ments sharing the same supporting transactions, and erators. Key patterns form a lattice, and this lattice can
closed itemsets are their maximal elements. Note that be easily traversed with a simple apriori-like algorithm.
the number of closed itemsets (five) is remarkably lower Unfortunately, an equivalence class can have more than
than the number of frequent itemsets (fifteen). one minimal element leading to the same closed item-
set. For example, the closed itemset {ABCD} of Fig.
1 may be mined twice, since it can be obtained as clo-
ABCD 1 sure of the two minimal elements of its equivalence class
{AD} and {CD}.
Other algorithms use instead a different technique
ABC2 ABD1 ACD1 BCD1 that we call closure climbing. As soon as a generator
is devised, its closure is computed, and new genera-
tors are built as supersets of the closed itemset discov-
AB 2 AC 2 AD 1 BC 2 BD 2 CD 1 ered. Since closed itemsets are maximal elements, this
strategy always guarantees to jump from an equiva-
lence class to another. Unfortunately, it does not guar-
A 2 B 3 C 3 D 2 antee that the new generators belong to equivalence
classes that were not previously visited.
Ø Regardless of the strategy adopted, some kind of
Equivalence
duplicate check has thus to be introduced. A naive ap-
B 3 ABD1
SUPPORT
Frequent
Closed Itemset Frequent itemset
Class
proach to check for duplicates is to search for each gen-
(a) erated closed itemset among all the ones mined so far.
Indeed, in order to avoid to perform a lot of expen-
TID items sive closure operations, several algorithms exploit the
1 B D following lemma:
2 A B C D Lemma 1 Given two itemsets X and Y , if X ⊂ Y and
3 A B C supp(X) = supp(Y ) (i.e. |g(X)| = |g(Y )|), then c(X) =
4 C c(Y ).
(b)
Proof. If X ⊂ Y , then g(Y ) ⊆ g(X). Since |g(Y )| =
Figure 1. (a) Lattice of frequent itemsets with |g(X)| then g(Y ) = g(X). g(X) = g(Y ) ⇒ f (g(X)) =
closed itemsets and equivalence classes given by f (g(Y )) ⇒ c(X) = c(Y ).
the dataset (b).
Therefore, given a generator X, if we find an al-
ready mined closed itemsets Y that set-includes X,
and the supports of Y and X are identical, we can minimal elements of each equivalence class. Since a k-
conclude that c(X) = c(Y ). Hence we can prune the itemset is a key pattern if and only if no one of its
generator X without actually calculating its closure. (k − 1)-subsets has the same support, minimal ele-
Also this duplicates checking strategy is however ex- ments are discovered with an intensive subset check-
pensive, both in time and space. In time because we ing. In its second step, A-CLOSE calculates the closure
may need to search for the inclusion of each genera- of all the minimal generators previously found. Since a
tor in a huge number of closed itemsets, in space be- single equivalence class may have more than one min-
cause to perform it we need to keep all the closed item- imal itemsets, redundant closures may be computed.
sets in the main memory. To reduce such costs, closed A-CLOSE performance suffers from the high cost of
sets can be stored in compact prefix tree structures, in- the off-line closure calculation and the huge number of
dexed by one or more levels of hashing. subset searches.
The authors of FP-Growth [2] (J. Han, et al.) pro-
Calculating Closures. To calculate the closure of an
posed CLOSET [6] and CLOSET+ [7]. These two algo-
itemset X, we have to apply the Galois operator c. Ap-
rithms inherit from FP-Growth the compact FP-Tree
plying c requires to intersect all the transactions of the
data structure and the exploration technique based on
dataset including X. Another way to calculate the clo-
recursive conditional projections of the FP-Tree. Fre-
sure is given by the following lemma:
quent single items are detected after a first scan of the
Lemma 2 Given an itemset X and an item i, if g(X) ⊆ dataset, and with another scan the pruned transactions
g(i) ⇒ i ∈ c(X). are inserted in the FP-Tree stored in the main memory.
With a depth first browsing of the FP-Tree and recur-
Proof. Since g(X ∪ i) = g(X) ∩ g(i), g(X) ⊆ g(i) ⇒ sive conditional FP-Tree projections, CLOSET mines
g(X ∪ i) = g(X). Therefore, if g(X ∪ i) = g(X) then closed itemsets by closure climbing, and growing up
f (g(X ∪ i)) = f (g(X)) ⇒ c(X ∪ i) = c(X) ⇒ i ∈ c(X). frequent closed itemsets with items having the same
support in the conditional dataset. Duplicates are dis-
From the above lemma, we know that if g(X) ⊆ g(i), covered with subset checking by exploiting Lemma 2.
then i ∈ c(X). Therefore, by performing this inclusion Thus, all closed sets previously discovered are kept in
check for all the items in I not included in X, we can the main memory, and are indexed by a two level hash.
incrementally compute c(X). Note that, since the set CLOSET+ is similar to CLOSET, but exploits an
g(i) can be represented by the tid-list associated with adaptive behaviour in order to fit both sparse and dense
i, this suggests the adoption of a vertical format for the datasets. As regards the duplicate detection technique,
input dataset in order to efficiently implement the in- CLOSET+ introduces a new one for sparse datasets
clusion check: g(X) ⊆ g(i). named upward checking. This technique consists in the
The closure calculation can be performed off-line or intersection of every path of the initial FP-Tree leading
on-line. In the first case we firstly retrieve the com- to a candidate closed itemset X, if such intersection is
plete set of generators, and then we calculate their clo- empty then X is actually closed. The rationale for us-
sures. In the second case, as soon as a new generator ing it only in sparse dataset is that the transactions
is discovered, its closure is computed on-the-fly. are short, a thus the intersections can be performed
The algorithms that compute closures on-line are quickly. Note that with dense dataset, where the trans-
generally more efficient. This is because they can adopt actions are usually longer, closed itemsets equivalence
the closure climbing strategy, according to which new classes are large and the number of duplicates is high,
generators are created recursively from closed itemsets. such technique is not used because of its inefficiency,
These generators are likely longer than key patterns, and CLOSET+ steps back using the same strategy of
which are the minimal itemsets of the equivalence class CLOSET, i.e. storing every mined closed itemset.
and thus are the shorter possible generators. Obviously, FPCLOSE [1], which is a variant of CLOSET+, re-
the longer the generator is, the fewer checks (on fur- sulted to be the best algorithm for closed itemsets min-
ther items to add) are needed to get its closure. ing presented at the ICDM 2003 Frequent Itemset Min-
ing Implementations Workshop.
3. Related Works CHARM [9] (M. Zaki, et al.) performs a bottom-up
depth-first browsing of a prefix tree of frequent item-
The first algorithm proposed for mining closed item- sets built incrementally. As soon as a frequent itemset
sets was A-CLOSE [5] (N. Pasquier, et al.). A-CLOSE is generated, its tid-list is compared with those of the
first browses level-wise the frequent itemsets lattice by other itemsets having the same parent. If one tid-list
means of an Apriori-like strategy, and mines all the includes another one, the associated nodes are merged
since both the itemsets surely belong to the same equiv- that do not result order preserving according to the def-
alence class. Itemset tid-lists are stored in each node of inition below.
the tree by using the diff-set technique [8]. Since differ-
Definition 2 A generator X = Y ∪ i, where Y is a
ent paths can however lead to the same closed itemset,
closed itemset and i 6∈ Y , is said to be order preserv-
also in this case a duplicates pruning strategy is imple-
ing one iff i ≺ (c(X) \ X).
mented. CHARM adopts a technique similar to that of
CLOSET, by storing in the main memory the closed The following Theorem shows that, for any closed
itemsets indexed by single level hash. itemset Y , it is possible to find a sequence of order pre-
According to our classification, A-CLOSE exploits a serving generators in order to climb a sequence of clo-
key pattern browsing strategy and performs off-line clo- sure itemsets and arrive at Y . The following Corollary
sure calculations, while CHARM, CLOSET+ and FP- states that this sequence is unique.
CLOSE are different implementations of the same clo- Theorem 1 For each closed itemset Y 6= c (∅), there ex-
sure climbing strategy with incremental closure com- ists a sequence of n (n ≥ 1) items i0 ≺ i1 ≺ ... ≺ in−1
putation. such that
4. Removing duplicate generators of {gen0 , gen1 , . . . , genn−1 } = {Y0 ∪i0 , Y1 ∪i1 , . . . , Yn−1 ∪in−1 }
closed itemsets
where the various geni are order preserving generators,
In this Section we discuss a particular visit of the lat- with Y0 = c (∅), Yj+1 = c(Yj ∪ ij ) ∀j ∈ [0, n − 1] and
tice of frequent sets used by our algorithm to identify Y = Yn .
unique generators of each equivalence class, and com-
pute all the closed patterns through the minimum num- Proof. First of all, we show that given a generic gener-
ber of closure calculations. ator gen ⊆ Y , c(gen) ⊆ Y . More formally, if ∃Y 0 such
In our algorithm, we use closure climbing to browse that Y 0 is a closed itemset, and Y 0 ⊂ Y , and we extend
the search space, find generators and compute their Y 0 with an item i ∈ Y \ Y 0 to obtain gen = Y 0 ∪ i ⊆ Y ,
closure. As soon as a generator is found, its closure is then ∀j ∈ c(gen), j ∈ Y .
computed, and new generators are built as supersets Note that g(Y ) ⊆ g(gen) because gen ⊆ Y . More-
of the closed itemset discovered so far. So, each gen- over, if j ∈ c(gen), then g(c(gen)) ⊆ g(j). Thus, since
erator gen browsed by our algorithm can be generally g(Y ) ⊆ g(gen), then g(Y ) ⊆ g(j) also holds, so that
represented as gen = Y ∪ i, where Y is a closed item- j ∈ c(Y ) too. So, if j 6∈ Y hold, Y would not be closed,
set, and i, i 6∈ Y is an item in I 1 . and this is in contradiction with the hypothesis.
Looking at Figure 1.(a), we can unfortunately dis- As regards the proof of the Theorem, we show it by
cover multiple generators gen = Y ∪ i, whose closures constructing a sequence of closed itemsets and associ-
produce an identical closed itemset. For example, we ated generators having the properties stated above.
have four generators, {A}, {A, B}, {A, C} and {B, C}, We have that Y0 = c (∅). All the items in Y0 ap-
whose closure is equal to the closed itemsets {A, B, C}. pear in every transaction of the dataset and therefore
Note that all these generators have the form Y ∪ i, by definition of closure they must be included also in
since they can be obtained by adding a single items to Y , i.e. Y0 ⊆ Y .
a smaller closed itemset, namely ∅, {B} and {C}. Since Y0 6= Y by definition, we choose
The technique exploited by our algorithm to detect i0 = min≺ (Y \ Y0 ), i.e. i0 is the smallest item
duplicate generators exploits a total lexicographic or- in {Y \ Y0 } with respect to the lexicographic or-
der relation ≺ between all the itemsets of our search der ≺, in order to create the first order preserv-
space2 . Since there exist a relation ≺ between each pair ing generator {Y0 ∪ i0 }. Afterwards we calculate
of k-itemsets, in order to avoid duplicate closed item- Y1 = c(Y0 ∪ i0 ) = c(gen0 ).
sets, we do not compute the closure of the generators Once Y1 is found, if Y1 = Y we stop.
Otherwise we choose i1 = min≺ (Y \ Y1 ), where
1 For each closed itemset Y 0 6= c (∅), it is straightforward to
i0 ≺ i1 by construction, in order to build the next or-
show that there must exists at least a generator having the der preserving generator gen1 = Y1 ∪ i1 and we calcu-
form gen = Y ∪ i, where Y , Y ⊂ Y 0 , is a closed itemset, i 6∈ Y , late Y2 = c(Y1 ∪ i1 ) = c(gen1 ).
and Y 0 = c(gen).
Once Y2 is found, if Y2 = Y we stop, otherwise we
2 This lexicographic order is induced by an order relation be-
tween single item literals, according to which each k-itemset iterate, by choosing i2 = min≺ (Y \ Y2 ), and so on.
I can be considered as a sorted set of k distinct items Note that each generator genj = {Yj ∪ ij } is order
{i0 , i1 , ..., ik }. preserving, because c({Yj ∪ ij }) = Yj+1 ⊆ Y and ij is
the minimum item in {Y \Yj } by construction, i.e. ij ≺ set of closed itemsets already mined to the tid-lists as-
{Yj+1 \ {Yj ∪ ij }}. sociated with single items. This new technique is not
resource demanding, because frequent closed itemsets
need not to be stored in the main memory during the
Corollary 1 For each closed itemset Y 6= c (∅), computation, and it is not time demanding, because the
the sequence of order preserving generators order preserving check is cheaper than searching the set
{gen0 , gen1 , . . . , genn } = {Y0 ∪ i0 , Y1 ∪ i1 , . . . , Yn ∪ in } of closed itemsets mined so far. Note that CLOSET+
as stated in Theorem 1 is unique. needs the initial FP-tree as an additional requirement
Proof. Since all the items in Y0 appear in every transac- the current FP-tree in use, and morover does not use
tion of the dataset, by definition of closure, they must its upward checking tchnique with dense datasets.
be included also in Y , we have that Y0 = c (∅).
During the construction of the sequence of genera- 5. The DCI Closed algorithm.
tors, suppose that we choose ij 6= min≺ (Y \ Yj ) to con-
struct generator genj . Since genj and all the following The pseudo-code of the recursive procedure
generators must be order preserving, it should be im- DCI Closed() is shown in Algorithm 1. The pro-
possible to obtain Y , since we can not consider any- cedure receives three parameters: a closed item-
more the item i = min≺ (Y \ Yj ) ∈ Y in any other gen- sets CLOSED SET, and two sets of items, i.e. the
erator or closure in order to respect the order preserv- PRE SET and POST SET.
ing property. The procedure will output all the non-duplicate
closed itemsets that properly contain CLOSED SET.
Looking at Figure 1.(a), for each closed itemset we In particular, the goal of the procedure is to deeply
can easily identify those unique sequences of order pre- explore each valid new generator obtained from
serving generators. For example, for the the closed CLOSED SET by extending it with all the ele-
itemset Y = {A, B, C, D}, we have Y0 = c(∅) = ∅, ment in POST SET.
gen0 = ∅ ∪ A, Y1 = c(gen0 ) = {A, B, C}, gen1 = Before calling procedure DCI Closed(), the dataset
{A, B, C} ∪ D, and, finally, Y = c(gen1 ). Another ex- D is scanned to determine the frequent single items
ample regards the closed itemset Y = {B, D}, where F1 ⊆ I, and to build the bitwise vertical dataset
we have Y0 = c(∅) = ∅, gen0 = ∅ ∪ B, Y1 = c(gen0 ) = VD containing the various tid-lists g(i), ∀i ∈ F1 .
B, gen1 = B ∪ D, and, finally, Y = c(gen1 ). The procedure is thus called by passing as arguments
CLOSED SET = c(∅), PRE SET = ∅, and POST SET
In order to exploit the results of Theorem 1, we need = F1 \ c(∅). Note that the itemset c(∅) contains, if
a fast way to check whether a generator is order pre- any, the items that occur in all the transactions of the
serving. dataset D.
Lemma 3 Let gen = Y ∪ i be a generator of a closed The procedure builds all the possible generators,
itemset where Y is a closed itemset and i 6∈ Y , and let by extending CLOSED SET with the various items
pre-set(gen) = {j ≺ i | j 6∈ gen}. gen is not order pre- in POST SET (lines 2–3). The infrequent and dupli-
serving, iff ∃j ∈ pre-set(gen), such that g(gen) ⊆ g(j). cate generators (i.e., the not order preserving ones) are
however discarded as invalid (lines 4-5). Note that the
Proof. If g(gen) ⊆ g(j), then j ∈ c(gen). Since, by hy- items i ∈ P OST SET used to obtain those invalid gen-
pothesis, j ≺ i, it is not true that i ≺ (c(gen) \ gen) be- erators will no longer be considered in the following
cause j ∈ (c(gen) \ gen). recursive calls. Only the valid generators are then ex-
tended to compute their closure (lines 6–15). It is worth
The previous Lemma introduces the concept of noting that each generator new gen ← CLOSED SET
pre-set(gen), where gen = {Y ∪ i} is a genera- ∪ i is strictly extended according to the order preserv-
tor, and gives a way to check the order preserv- ing property, i.e. by using all items j ∈ POST SET
ing property of gen by scanning all the g(j), for all such that i ≺ j (line 8). Note that all the items j, i ≺ j,
j ∈ pre-set(gen). which do not belong to c(new gen) are included in the
We have thus contributed a deep study on the the new POST SET (line 12) and are used for the next re-
problem of duplicates in mining frequent closed item- cursive call. At the end of this process, a new closed set
sets. By reformulating the duplicates problem as the (CLOSED SETN ew ← c(new gen)) is obtained (line
problem of visiting the lattice of frequent itemsets, ac- 15). From this new closed set, new generators and cor-
cording to a total (lexicographic) order, we have moved responding closed sets can be build, by recursively call-
the dependencies of the order preserving check from the ing the procedure DCI Closed() (line 16). Finally, it
Algorithm 1 DCI-closed pseudocode
1: procedure DCI Closed(CLOSED SET, PRE SET, POST SET)
2: for all i ∈ POST SET do . Try to create a new generator
3: new gen ← CLOSED SET ∪ i
4: if supp(new gen) ≥ min supp then . new gen is frequent
5: if is dup(new gen, PRE SET) = FALSE then . Duplication check
6: CLOSED SETN ew ← new gen
7: POST SETN ew ← ∅
8: for all j ∈ POST SET, i ≺ j do . Compute closure of new gen
9: if g(new gen) ⊆ g(j) then
10: CLOSED SETN ew ← CLOSED SETN ew ∪ j
11: else
12: POST SETN ew ← POST SETN ew ∪ j
13: end if
14: end for
15: Write out CLOSED SETN ew and its support
16: DCI Closed(CLOSED SETN ew , PRE SET, POST SETN ew )
17: PRE SET ← PRE SET ∪ i
18: end if
19: end if
20: end for
21: end procedure
22:
23:
24: function is dup(new gen, PRE SET)
25: for all j ∈ PRE SET do . Duplicate check
26: if g(new gen) ⊆ g(j) then
27: return FALSE . new gen is not order preserving!!
28: end if
29: end for
30: return TRUE
31: end function
is worth to point out that, in order to force the lexico- dure to only build new generators X ∪ j, where i ≺ j
graphic order of the visit, the two for all’s (line 2 and (according to the hypotheses of Theorem 1.
line 8) have to extract items from POST SET while re- The composition of the new PRE SET depends in-
specting this order. stead on the valid generators3 that precedes new gen =
Before recursively calling the procedure, it is neces- Y ∪ i in the lexicographic order. If all the generators
sary to prepare the suitable PRE SET and POST SET were valid, it would simply be composed of all the items
to be passed to the new recursive call of the procedure. j that precede i in the lexicographic order, and j 6∈ X =
Upon each recursive call to the procedure, the size of c(new gen). In other words, the new PRE SET would
the new POST SET is monotonically decreased, while be the complement set of X ∪ POST SETnew .
the new PRE SET’s size is instead increased. While the composition of POST SET guarantees
As regards the composition of the new POST SET, that the various generators will be produced accord-
assume that the closed set X =CLOSED SETnew ing to the lexicographic order ≺, the composition of
passed to the procedure (line 16) has been obtained PRE SET guarantees that duplicate generators will be
by computing the closure of a generator new gen = pruned by function is dup().
Y ∪ i (c(new gen)), where Y =CLOSED SET and
Since we have shown that for each closed itemset
i ∈ POST SET. The POST SETnew to be passed to
Y exists one and only one sequence of order preserv-
the recursive call of the procedure is built as the set of
ing generators and since our algorithm clearly explores
all the items that follow i in the lexicographic order and
every possible order preserving generator from every
that have not been already included in X. More for-
mally, POST SETnew = {j ∈ F1 | i ≺ j and j 6∈ X}.
This condition allows the recursive call of the proce- 3 The ones that have passed the frequency and duplicate tests.
closed itemset, we have that the algorithm is complete we may know that some large portions of the bitwise
and does not produce any duplicate. tidlists g(X) are however strictly included in g(j). Let
g(X)j be the portion of the bitwise tidlist g(X) strictly
5.0.1. Some optimizations exploited in the al- included in the corresponding portion of g(j), namely
gorithm. We adopted a large amount of optimizations g(j). Hence, since g(X)j ⊆ g(j), it is straightforward to
to reduce the cost of the bitwise intersections, needed
show that g(X ∪ Y )j ⊆ g(j) continues to hold, for all
for the duplication and closure computations (line 10
itemset Y used to extend X, because g(X ∪ Y ) ⊆ g(X)
and 34). For the sake of simplicity, these optimizations
holds . So, when we extend X to obtain a new genera-
are not reported in the pseudo-code shown in Algo-
tor, we can limit the inclusion check of the various g(j)
rithm 1.
to the complementary portions of tid-lists g(j), thus
DCI-CLOSED inherits the internal representation of
strongly reducing the cost of them.
our previous works DCI[4] and kDCI[3].The dataset is
stored in the main memory using a vertical bitmap rep- 5.0.2. Dealing with sparse datasets. It is possi-
resentation. With two successive scans of the dataset, ble to show that in sparse datasets the number of closed
a bitmap matrix DM ×N is stored in the main memory. itemsets is nearly equal to the number of frequent ones,
The D(i, j) bit is set to 1 if and only if the j -th trans- so near that they are often the same. This means that
action contains the i -th frequent single item. Row i of the techniques for mining closed itemsets are of no use,
the matrix thus represent the tid-list of item i. because almost every duplicate checking or closure cal-
The columns of D are then reordered to profit of culating procedure is likely to fail.
data correlation. This is possible and highly worthwhile For this reason, in case of sparse datasets, we pre-
when we mine dense datasets. As in [3][4], columns are ferred to exploit our frequent itemset mining algorithm
reordered to create a submatrix E of D having all its [3] with an additional closedness test over the frequent
rows identical. Every operation (e.g. intersection ones) itemset discovered. Given a new frequent itemset X,
involving rows in the submatrix E will be performed every of it subset of length |X| − 1 with the same sup-
only once, thus gaining strong performance improve- port as X is marked as non closed. Experiments showed
ments. that this approach is fruitful (see Fig. 2.b).
This kind of representation fits with our framework,
because the three main operations, i.e. support count, 5.0.3. Space complexity. For all the algorithms re-
closure, and duplicates check, can be fastly performed quiring to keep in the main memory the whole set of
with cheap bit-wise AND/OR operation. closed itemsets to perform the duplicate check, the size
of the output is actually a lower bound on their space
Besides the DCI optimizations, specifically tailored complexity. Conversely, we will show that the amount
for sparse and dense datasets, we exploited more spe- of memory required by an implementation based on
cific techniques made possible by the depth-first visit our duplicate discarding technique is independent of
of the lattice of itemsets. the size of the output. To some extent, its memory oc-
In order to determine that the itemset X is closed, cupation depends on those data structures that also
the tidlist g(X) must have been compared with all the need to be maintained in memory by other algorithms
g(j)’s, for all items j contained in the pre-list (post- that visit depth-first the lattice and exploit tid-list in-
list) of X, i.e. the items that precede (follows) all tersections adopting a vertical datasets.
items included in X according to a lexicographic or- The main information needed to be kept in the
der. The PRE SET must have been accessed for check- main memory is the tid-list of each node in the cur-
ing duplicate generators, and the POST SET for com- rent path along the lattice, and the tid-list of every
puting the closure. In particular, for all j ∈ PRE SET frequent single item. In this way we are able to browse
∪ POST SET, we already know that g(X) * g(j), oth- the search space intersecting nodes with single item
erwise those items j must have been included in X. tid-lists, and to discard duplicates checking the order
Therefore, since g(X) must have already been com- preserving property.
pared with all the g(j), for all items j contained in The worst case of memory occupation happens when
the PRE SET (POST SET) of X, we may save some the number of generators and frequent single items is
important information regarding each comparison be- maximal: this occurs when c(∅) = ∅ and every item-
tween g(j) and g(X). Such information will be used set is frequent and closed. If N is the number of fre-
to reduce the cost of the following use of g(j), when quent single items, the deepest path has N nodes, and
these tidlists g(j) will have to be exploited to look for since one of this node is a single item, the total num-
further closed itemsets that include/extend X. In par- ber of tid-lists to be kept in the main memory is 2N −1.
ticular, even if, for all j, it is true that g(X) * g(j), Since the length of a tid-list is equal to the number of
CONNECT RETAIL
4 2
10 10
FPCLOSE FPCLOSE
CLOSET+ CLOSET+
DCI−CLOSED DCI−CLOSED
3
10
size (MegaBytes)
time (sec.)
1
10
2
10
1 0
10 10
45 40 35 30 25 20 15 10 5 0.12 0.1 0.08 0.06 0.04 0.02 0
support % support %
(a) (b)
CHESS PUMSB
3 3
10 10
FPCLOSE FPCLOSE
CLOSET+ CLOSET+
DCI−CLOSED DCI−CLOSED
2 2
10 10
tempo (sec.)
time (sec.)
1 1
10 10
0 0
10 10
45 40 35 30 25 20 15 75 70 65 60 55 50 45 40
support % support %
(c) (d)
CONNECT PUMSB*
3 3
10 10
FPCLOSE FPCLOSE
CLOSET+ CLOSET+
DCI−CLOSED DCI−CLOSED
2 2
10 10
tempo (sec.)
time (sec.)
1 1
10 10
0 0
10 10
45 40 35 30 25 20 15 10 5 22 20 18 16 14 12 10 8
support % support %
(e) (f)
Figure 2. (a) Memory occupation on the connect dataset as a function of the minimum support threshold.
(b-f) Execution times of FPCLOSE, CLOSET+, and DCI-CLOSET as a function of the minimum support
threshold on various publicly available datasets.
transactions T in the dataset, the space complexity of The experimental evaluation demonstrated that our
our algorithm is approach is very effective. The proposed implementa-
tion outperforms FPCLOSE and CLOSET+ in all the
O ((2N − 1) × T ) . test conducted and requires orders of magnitude less
memory.
Figure 2.(a) plots memory occupation of FPCLOSE,
CLOSET+ and our algorithm DCI-CLOSED when References
mining the connect dataset as a function of the sup-
port threshold. The experimental results agree with our [1] Gosta Grahne and Jianfei Zhu. Efficiently using prefix-
estimates: whereas FPCLOSE and CLOSET+ mem- trees in mining frequent itemsets. In Proceedings of the
ory occupation grows exponentially because of the huge IEEE ICDM Workshop on Frequent Itemset Mining Im-
number of closed itemsets generated, our implementa- plementations, November 2003.
tion needs much less memory (up to two order of mag- [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent
nitude) because its occupation depends linearly on N . patterns without candidate generation. In Proc. SIG-
MOD ’00, pages 1–12, 2000.
5.1. Experimental results [3] Claudio Lucchese, Salvatore Orlando, Paolo Palmerini,
Raffaele Perego, and Fabrizio Silvestri. kdci: a multi-
We tested our implementation on a suite of pub- strategy algorithm for mining frequent sets. In Proceed-
licly available dense datasets (chess, connect, pumsb, ings of the IEEE ICDM Workshop on Frequent Itemset
Mining Implementations, November 2003.
pumsb*), and compared the performances with those of
two well known state of the art algorithms: FPCLOSE [4] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
[1], and CLOSET+ [7]. FPCLOSE is publicly available Adaptive and resource-aware mining of frequent sets. In
as http://fimi.cs.helsinki.fi/src/fimi06.html, Proc. The 2002 IEEE International Conference on Data
Mining (ICDM02), page 338345, 2002.
while the Windows binary executable of CLOSET+
was kindly provided us from the authors. Since FP- [5] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi
CLOSE was already proved to outperform CHARM in Lakhal. Discovering frequent closed itemsets for associa-
every dataset, we did not used CHARM in our tests. tion rules. In Proc. ICDT ’99, 1999.
The experiments were conducted on a Windows XP [6] Jian Pei, Jiawei Han, and Runying Mao. Closet: An ef-
PC equipped with a 2.8GHz Pentium IV and 512MB ficient algorithm for mining frequent closed itemsets. In
of RAM memory. The algorithms FPCLOSE and DCI- SIGMOD International Workshop on Data Mining and
Knowledge Discovery, May 2000.
CLOSED were compiled with the gcc compiler avail-
able in the cygwin environment. [7] Jian Pei, Jiawei Han, and Jianyong Wang. Closet+:
As shown in Fig. 2.(b-f), DCI-CLOSED outperforms Searching for the best strategies for mining frequent
both algorithms in all the tests conducted. CLOSET+ closed itemsets. In SIGKDD ’03, August 2003.
performs quite well on the connect dataset with low [8] Mohammed J. Zaki and Karam Gouda. Fast vertical min-
supports, but in any other case it is about two orders ing using diffsets. In Technical Report 01-1, Computer
of magnitude slower. FPCLOSE is effective in pumsb*, Science Dept., Rensselaer Polytechnic Institute, March
where it is near to DCI-CLOSED, but it is at one or- 2001.
der of magnitude slower in all the other tests. [9] Mohammed J. Zaki and Ching-Jui Hsiao. Charm: An ef-
ficient algorithm for closed itemsets mining. In 2nd SIAM
International Conference on Data Mining, April 2002.
6. Conclusions
In this paper we provide a deep study on the prob-
lem of mining frequent closed itemsets, formalizing a
general framework fitting every mining algorithm. Use
such framework we were able to analyse the problem
of duplicates rising in this new mining problem.
We have proposed a technique for promptly detect-
ing and discarding duplicates, without the need of keep-
ing in the main memory the whole set of closed pat-
terns, and we implemented this technique into a new
algorithm which uses a vertical bitmap representation
of the dataset.