=Paper=
{{Paper
|id=Vol-90/paper-11
|storemode=property
|title=kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets
|pdfUrl=https://ceur-ws.org/Vol-90/palmerini.pdf
|volume=Vol-90
|dblpUrl=https://dblp.org/rec/conf/fimi/OrlandoLPPS03
}}
==kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets==
kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets
Claudio Lucchese1 , Salvatore Orlando1 , Paolo Palmerini1,2 ,
Raffaele Perego2 , Fabrizio Silvestri2,3
1 2 3
Dipartimento di Informatica ISTI-CNR Dipartimento di Informatica
Università Ca’ Foscari di Venezia Consiglio Nazionale delle Università di Pisa
Venezia, Italy Ricerche Pisa, Italy
{orlando,clucches}@dsi.unive.it Pisa, Italy silvestri@di.unipi.it
{r.perego,p.palmerini}@isti.cnr.it
Abstract the minimum support threshold are said to be frequent.
The complexity of the FSC problem relies mainly
This paper presents the implementation of kDCI, an in the potentially explosive growth of its full search
enhancement of DCI [10], a scalable algorithm for dis- space, whose dimension d is, in the worst case, d =
covering frequent sets in large databases. P|tmax | |I|
k=1 k , where tmax is the maximum transac-
The main contribution of kDCI resides on a novel tion length. Taking into account the minimum support
counting inference strategy, inspired by previously threshold, it is possible to reduce the search space, using
known results by Basted et al. [3]. Moreover, multiple the well known downward closure relation, which states
heuristics and efficient data structures are used in or- that an itemset can only be frequent if all its subsets
der to adapt the algorithm behavior to the features of are frequent as well. The exploitation of this property,
the specific dataset mined and of the computing platform originally introduced in the Apriori algorithm [1], has
used. transformed a potentially exponentially complex prob-
kDCI turns out to be effective in mining both short lem, into a more tractable one.
and long patterns from a variety of datasets. We con-
Nevertheless, the Apriori property alone is not suf-
ducted a wide range of experiments on synthetic and
ficient to permit to solve the FSC problem in a rea-
real-world datasets, both in-core and out-of-core. The
sonable time, in all cases, i.e. on all possible datasets
results obtained allow us to state that kDCI perfor-
and for all possible interesting values of s. Indeed, an-
mances are not over-fitted to a special case, and its
other source of complexity in the FSC problem resides
high performance is maintained on datasets with differ-
in the dataset internal correlation and statistical proper-
ent characteristics.
ties, which remain unknown until the mining is com-
pleted. Such diversity in the dataset properties is re-
flected in measurable quantities, like the total number
1 Introduction of transactions, or the total number of distinct items |I|
appearing in the database, but also in some other more
Despite the considerable amount of algorithms pro- fuzzy properties which, although commonly recognized
posed in the last decade for solving the problem of find- as important, still lack a formal and univocal definition.
ing frequent patterns in transactional databases (among It is the case, for example, of the notion of how dense a
the many we mention [1] [11] [6] [13] [14] [4] [3] [7]), dataset is, i.e. how much its transactions tend to resem-
a single best approach still has to be found. ble among one another.
The Frequent Set Counting (FSC) problem consists Several important results have been achieved for spe-
in finding all the set of items (itemsets) which occur in cific cases. Dense datasets are effectively mined with
at least s% (s is called support) of the transactions of a compressed data structure [14], explosion in the candi-
database D, where each transaction is a variable length dates can be avoided using effective projections of the
collection of items from a set I. Itemsets which verify dataset [7], the support of itemsets in compact datasets
can be inferred, without counting, using an equivalence Algorithm 1 kDCI
class based partition of the dataset [3]. Require: D, min supp
In order to take advantage of all these, and more spe- // During first scan get optimization figures
cific results, hybrid approaches have been proposed [5]. F1 = first scan(D, min supp)
Critical to this point is when and how to adopt a given // second and following scans on a temporary db D0
solution instead of another. In lack of a complete theo- F2 = second scan(D0 , min supp)
retical understanding of the FSC problem, the only so- k=2
while (D0 .vertical size() > memory available()) do
lution is to adopt a heuristic approach, where theoretical
k++
reasoning is supported by direct experience leading to a // count-based iteration
strategy that tries to cover a variety of cases as wide as Fk = DCP(D’, min supp, k)
possible. end while
Starting from the previous DCI (Direct Count & In- k++
tersect) algorithm [10] we propose here kDCI, an en- // count-based iteration + create vertical database VD
hanced version of DCI that extends its adaptability to the Fk = DCP(D’, VD, min supp, k)
dataset specific features and the hardware characteristics dense = V D.is dense())
of the computing platform used for running the FSC al- while (Fk 6= ∅) do
gorithm. Moreover, in kDCI we introduce a novel count- k++
if (use key patterns()) then
ing inference strategy, based on a new result inspired by
if (dense) then
the work of Bastide et al. in [3].
Fk = DCI dense keyp(VD, min supp, k)
kDCI is a multiple heuristics hybrid algorithm, able else
to adapt its behavior during the execution. Since it ori- Fk = DCI sparse keyp(VD, min supp, k)
gins from the already published DCI algorithm, we only end if
outline in this paper how kDCI differs from DCI. A de- else
tailed description of the DCI algorithm can be found if (dense) then
in [10]. Fk = DCI dense(VD, min supp, k)
else
Fk = DCI sparse(VD, min supp, k)
2 The kDCI algorithm end if
end if
Several considerations concerning the features of real end while
datasets, the characteristics of modern hw/sw system, as
well as scalability issues of FSC algorithms have moti-
vated the design of kDCI. As already pointed out, trans- nique to remove infrequent items and short transactions.
actional databases may have different characteristics in A temporary dataset is therefore written to disk at every
terms of correlations among the items inside transac- iteration. The first steps of the algorithm are described
tions and of transactions among themselves [9]. A de- in [8] and [10] and remain unchanged in kDCI. In kDCI
sirable feature of an FSC algorithm should be the ability we only improved memory management by exploiting
to adapt its behavior to these characteristics. compressed and optimized data structures (see Section
Modern hw/sw systems need high locality for ex- 2.1 and 2.2).
ploiting memory hierarchies effectively and achieving The effectiveness of pruning is related to the possi-
high performance. Algorithms have to favor the ex- bility of storing the dataset in main memory in vertical
ploitation of spatial and temporal locality in accessing format, due to the dataset size reduction. This normally
in-core and out-core data. occurs at the first iterations, depending on the dataset,
Scalability is the main concern in designing algo- the support threshold and the memory available on the
rithms that aim to mine large databases efficiently. machine, which is determined at run time.
Therefore, it is important to be able to handle datasets Once the dataset can be stored in main memory, kDCI
bigger than the available memory. switches to the vertical representation, and applies sev-
We designed and implemented our algorithm kDCI eral heuristics in order to determine the most effective
keeping in mind such performance issues. The pseudo strategy for frequent itemset counting.
code of kDCI is given in Algorithm 1. The most important innovation introduced in kDCI
kDCI inherits from DCI the level-wise behavior and regards a novel technique to determine the itemset sup-
the hybrid horizontal-vertical dataset representation. As ports, inspired by the work of Bastide et al. [3]. As we
computation is started, kDCI maintains the database in will discuss in Section 2.4, in some cases the support of
horizontal format and applies an effective pruning tech- candidate itemsets can be determined without actually
prefix index suffix
counting transactions, but by a faster inference reason-
ing. a b c 3 d 0 Compressed
Moreover, kDCI maintains the different strategies a b d 7 e 1
Memory = 9 + 3 + 10 = 21
b d f 9 f 2
implemented in DCI for sparse and dense datasets. The g 3
result is a multiple strategy approach: during the execu- h 4
i 5
tion kDCI collects statistical information on the dataset l 6
that allows to determine which is the best approach for m 7
a b c d i 8
the particular case. a b c e n 9
a b c f
In the following we detail such optimizations and im- a b c g
a b d h
provements and the heuristics used to decide which op- a b d i Non−Compressed
a b d l
timization to use. a b d m Memory = 4 x 10 = 40
b d f i
b d f n
2.1 Dynamic data type selection
Figure 1. Compressed data structure used
The first optimization is concerned with the amount for itemset collection can reduce the
of memory used to represent itemsets and their counters. amount of memory needed to store the
Since such structures are extensively accessed during the itemsets.
execution of the algorithm, is it profitable to have such
data occupying as little memory as possible. This not
only allows to reduce the spatial complexity of the algo-
rithm, but also permits low level processor optimizations frequent itemsets. This representation take advantage
to be effective at run time. of prefix sharing among the lexicographically ordered
During the first scan of the dataset, global properties itemsets of the collection.
are collected like the total number of distinct frequent The compressed data structure is based on three ar-
items (m1 ), the maximum transaction size, and the sup- rays (Figure 1). At each iteration k, the first array (pre-
port of the most frequent item. fix) stores the different prefixes of length k − 1. In the
Once this information is available, we remap the sur- third array (suffix) all the length-1 suffixes are stored.
vived (frequent) items to contiguous integer identifiers. Finally, in the element i of the second array (index),
This allows us to decide the best data type to represent we store the position in the suffix array of the section
such identifiers and their counters. For example if the of suffixes that share the same prefix. Therefore, when
maximum support of any item is less than 65536, we can the itemsets in the collection have to be enumerated, we
use an unsigned short int to represent the item- first access the prefix array. Then, from the corre-
set counters. The same holds for the remapped identi- sponding entry in the index array we get the section
fiers of the items. The decision of which is the most of suffixes stored in suffix, needed to complete the
appropriate type to use for items and counters is taken at itemsets.
run time, by means of a C++ template-based implemen- From our tests we can say that, in all the interesting
tation of all the kDCI code. cases – i.e., when the number of candidate (or frequent)
Before remapping item identifiers, we also reorder iemsets explodes – this data structure works well and
them in increasingly support ordering: more frequent achieves up to 30% as compression ratio. For example,
items are thus assigned larger identifiers. This also sim- see the results reported in Figure 2.
plifies the intersection-based technique used for dense
datasets (see Section 2.3). 2.3 Heuristics
2.2 Compressed data structures One of the most important features of kDCI is its abil-
ity to adapt its behavior to the dataset specific character-
Itemsets are often organized in collections in many istics. It is well known that being able to distinguish be-
FSC algorithms. Efficient representation of such collec- tween sparse and dense datasets, for example, allows to
tions can lead to important performance improvements. adopt specific and effective optimizations. Moreover, as
In [8] we pointed out the advantages of storing candi- we will explain the Section 2.4, if the number of frequent
dates in directly accessible data structures for the first itemsets is much greater than the number of closed item-
passes of our algorithm. In kDCI we introduce a com- sets, it is possible to apply a counting inference proce-
pressed representation of an itemset collection, used to dure that allows to dramatically reduce the time needed
store in the main memory collections of candidate and to determine itemset supports.
δ = 0.2
BMS_View, min_supp=0.06% t=0..N t=0..N
120
compressed
non-compressed
100
d=pf/M=0.9 x 0.25 = 0.23 d=pf/M=0.5 x 0.3 = 0.15
i=0..M i=0..M
80
KBytes
f=M/4
f=M/3
60
40
20
p=90% p=50%
d> δ DENSE d< δ SPARSE
0
3 4 5 6 7 8 9 10 11 (a) (b)
k
(a)
Figure 3. Heuristic to establish a dataset
connect, min_supp=80% density or sparsity
700
compressed
non-compressed
600
500
to a density of d = f p = 0.25 × 0.9 = 0.23 which is
400
above the density threshold. For dataset (b) on the other
KBytes
300 hand, to a smaller intersection of p = 50% is common
200 to f = 1/3 of the items. In this last case the density
100
d = f p = 0.3 × 0.5 = 0.15 is lower than the threshold
0
and the dataset is considered as sparse. It is worth to no-
2 4 6 8
k
10 12 14
tice that since this notion of density depends on the min-
(b) imum support threshold, the same dataset can exhibits
different behaviors when mined with different support
Figure 2. Memory usage with compressed thresholds.
itemsets collection representation for BMS
with min sup=0.06% (a) and connect with Once the dataset density is determined, we adopted
min sup=80% (b) the same optimizations described in [10] for sparse and
dense datasets. We review them briefly for complete-
ness.
In kDCI we devised two main heuristics that allow to Sparse datasets. The main techniques used for
distinguish between dense and sparse datasets and to de- sparse datasets can be summarized as follows:
cide whether to apply the counting inference procedure
or not. – projection. Tidlists in sparse datasets are
The first heuristic is simply based on the measure of characterized by long runs of 0’s. When in-
the dataset density. Namely, we measure the correlation tersecting the tidlists associated with the 2-
among the tidlists corresponding to the most frequent prefix items belonging to a given candidate
items. We require that the maximum number of frequent itemset, we keep track of such empty ele-
items for which such correlation is significant, weighted ments (words), in order to perform the follow-
by the correlation degree itself, is above a given thresh- ing intersections faster. This can be consid-
old. ered as a sort of raw projection of the verti-
As an example, consider the two dataset in Figure 3, cal dataset, since some transactions, i.e. those
where tidlists are placed horizontally, i.e. rows corre- corresponding to zero words, are not consid-
spond to items and columns to transactions. Suppose ered at all during the following tidlist inter-
to choose a density threshold δ = 0.2. If we order the sections.
items according to their support, we have the most dense – pruning. We remove infrequent items from
region of the dataset at the bottom of each figure. Start- the dataset. This can result in some transac-
ing from the bottom, we find the maximum number of tion remaining empty or with too few items.
items whose tidlists have a significant intersection. In We therefore remove such transactions (i.e.
the case of dataset (a), for example, a fraction f = 1/4 columns in the our bitvector vertical repre-
of the items share p = 90% of the transactions, leading sentation) from the dataset. Since this bitwise
pruning may be expensive, we only perform
it when the benefits introduced are expected
to balance its cost.
Dense datasets.
If the dataset is dense, we expect to deal with
strong correlations among the most frequent items.
This not only means that the tidlists associated
with these most frequent items contain long runs
of 1’s, but also that they turn out to be very similar.
The heuristic technique adopted by DCI and con-
sequently by kDCI for dense dataset thus works as
follows:
– we reorder the columns of the vertical dataset
by moving identical segments of the tidlists
associated with the most frequent items to the
first consecutive positions;
– since each candidate is likely to include sev- Figure 4. Example lattice of frequent items.
eral of these most frequent items, we avoid
repeated intersections of identical segments.
The heuristic for density evaluation is applied only introduced in kDCI. We exploit a technique inspired by
once, as soon as the vertical dataset is built. After this the theoretical results presented in [3], where the PAS -
decision is taken, we further check if the counting infer- CAL algorithm was introduced. PASCAL is able to infers
ence strategy (see Section 2.4) can be profitable or not. the support of an itemset without actually count its oc-
The effectiveness of the inference strategy depends on currences in the database. In this seminal work, the au-
the ratio between the total number of frequent itemsets thors introduced the concept of key pattern (or key item-
and how many of them are key-patterns. The closer to 1 set). Given a generic pattern Q, it is possible to deter-
this ratio is, the less advantage is introduced by the in- mine an equivalence class [Q], which contains the set of
ference strategy. Since this ratio is not known until the patterns that have the same support and are included in
computation is finished, we found that the same infor- the same set of database transactions. Moreover, if we
mation can be derived from the average support of the define min[P ] as the set of the smallest itemsets in [P ],
frequent singletons (items), after the first scan. The idea a pattern P is a key pattern if P ∈ min[P ], i.e. no proper
behind this is that if the average support of the single subset of P is in the same equivalence class. Note that
items that survived the first scan is high enough, then we can have several key patterns for each equivalence
longer patterns can be expected to be frequent and more class. Figure 4 shows an example of a lattice of frequent
likely the number of key-patterns itemsets will be lower itemsets, taken from [3], where equivalence classes and
than that of frequent itemsets. We experimentally veri- key patterns are highlighted.
fied that this simple heuristic gives the correct output for Given an equivalence class [P ], we can also define a
all datasets - both real and synthetic. corresponding closed set [12]: the closed set c of [P ] is
To resume the rationale behind kDCI multiple strat- equal to max[P ], so that no proper supersets of c can
egy approach, if the key-patterns optimization can be belong to the same equivalence class [P ].
adopted, we use the counting inference method that al- Among the results illustrated in [3] we have the fol-
lows to avoid many intersections. For the intersections lowing important theorems:
that cannot be avoided and in the cases where the key-
patterns inference method cannot be applied, we further Theorem 1 Q is a key pattern iff supp(Q) 6=
distinguish between sparse and dense datasets, and ap- minp∈Q (supp(Q \ {p})).
ply the two strategies explained above.
Theorem 2 If P is not a key pattern, and P ⊆ Q, then
2.4 Pattern Counting Inference Q is a non-key pattern as well.
In this section we describe the count inference From Theorem 1 it is straightforward to observe that
method, which constitute the most important innovation if Q is a non-key pattern, then:
to f (Q) ⊇ f (Q \ (P \ P 0 ) ∪ P 0 ). Since P 0 ⊆ (Q \ (P \
supp(Q) = min(supp(Q \ {p})). (1) P 0 )) clearly holds, it follows that f (Q\(P \P 0 )∪P 0 ) =
p∈Q
f (Q\(P \P 0 )). So we can conclude that f (Q) ⊇ f (Q\
Moreover, Theorem 1 says that we can check (P \ P 0 )), which completes the proof.
whether Q is a key pattern by comparing its support 2
with the minimum support of its proper subsets, i.e.
minp∈Q (supp(Q \ {p})). We will show in the following The following corollary is trivial, since we defined
how to use this property to make faster candidate support supp(Q) = |f (Q)|.
counting.
Theorems 1 and 2 give the theoretical foundations Corollary 1 If P is a non-key pattern, and P ⊆ Q, the
for the PASCAL algorithm, which finds the support of support of Q can be computed as follows:
a non-key k-candidate Q by simply searching the mini-
mum supports of all its k − 1 subsets. Note that such supp(Q) = supp(Q \ (P \ P 0 ))
search can be performed during the pruning phase of where P 0 and P , P 0 ⊂ P , belong to the same equiva-
the Apriori candidate generation. DCI does not perform lence class, i.e. P, P 0 ∈ [P ].
candidate pruning because its intersection technique is
comparably faster. For this reason we will not adopt the Finally, we can introduce Corollary 2, which is a par-
PASCAL counting inference in kDCI. ticular case of the previous one.
The following theorem, partially inspired by the
proof of Theorem 2, suggests a faster way to compute Corollary 2 If Q is k-candidate (i.e. Q ∈ Ck ) and
the support of a non-key k-candidate Q. P , P ⊂ Q, is a frequent non-key (k-1)-pattern (i.e.
Before introducing the theorem, we need to define P ∈ Fk−1 ), there must exist P 0 ∈ Fk−2 , P 0 ⊂ P , such
the function f , which assigns to each pattern P the set that P and P 0 belong to the same equivalence class,
of all the transactions that include this pattern. We can i.e. P, P 0 ∈ [P ] and P and P 0 differ for a single item:
define the support of a pattern in terms of f : supp(P ) = {pdiff } = P \ P 0 . The support of Q can thus be com-
|f (P )|. Note that f is a monotonically decreasing func- puted as:
tion, i.e. if P1 ⊆ P2 ⇒ f (P2 ) ⊆ f (P1 ). This is obvi-
ous, because every transaction containing P2 surely con-
tains all the subsets of P2 . supp(Q) = supp(Q \ (P \ P 0 )) = supp(Q \ {pdiff })
Theorem 3 If P is a non-key pattern and P ⊆ Q, the Corollary 2 says that to find the support of a non-
following holds: key candidate pattern Q, we can simply check whether
Q \ {pdiff } belongs to Fk−1 , or not. If Q \ {pdiff } ∈
f (Q) = f (Q \ (P \ P 0 )). Fk−1 , then Q inherits the same support as Q \ {pdiff }
where P 0 ⊂ P , and P and P 0 belong to the same equiv- and is therefore frequent. Otherwise we can conclude
alence class, i.e. P, P 0 ∈ [P ]. that Q \ {pdiff } is not frequent.
Using the theoretical result of Corollary 2, we
P ROOF. Note that, since P is a non-key pattern, it is adopted the following strategy in order to determine the
surely possible to find a pattern P 0 , P 0 ⊂ P , belonging support of a candidate Q at step k.
to the same equivalence class [P ]. In kDCI, we store with each itemset P ∈ Fk−1 the
In order to demonstrate the Theorem we first show following information:
that f (Q) ⊆ f (Q \ (P \ P 0 )) and then that also
f (Q) ⊇ f (Q \ (P \ P 0 )) holds, thus proving the Theo- • supp(P );
rem hypotheses.
The first assertion f (Q) ⊆ f (Q \ (P \ P 0 )) holds • a flag indicating if P is a key pattern or not;
because (Q \ (P \ P 0 )) ⊆ Q, and f is a monotonically
decreasing function. • if P is non-key pattern, also the item pdiff such that
To prove the second assertion, f (Q) ⊇ f (Q \ (P \ P \ {pdiff } = P 0 ∈ [P ].
P 0 )), we can rewrite f (Q) as f (Q\(P \P 0 )∪(P \P 0 )),
which is equivalent to f (Q \ (P \ P 0 )) ∩ f (P \ P 0 ). Note that pdiff must be one of the items that we can re-
Since f is decreasing, f (P ) ⊆ f (P \ P 0 ). But, since move from P to obtain a proper subset P 0 of P , belong-
P, P 0 ∈ [P ], then we can write f (P ) = f (P 0 ) ⊆ f (P \ ing to the same equivalence class.
P 0 ). Therefore f (Q) = f (Q \ (P \ P 0 )) ∩ f (P \ P 0 ) ⊇ During the generation of a generic candidate Q ∈ Ck ,
f (Q\(P \P 0 ))∩f (P 0 ). The last inequality is equivalent as soon as kDCI discovers that one of the subsets of Q,
Dataset = connect Dataset = chess
10000 10000
ECLATd ECLATd
Total Execution Time (sec) FP FP
Total Execution Time (sec)
OP 1000 OP
1000 kDCI kDCI
100
100
10
10
1
1 0.1
50 55 60 65 70 75 80 85 90 30 35 40 45 50 55 60 65 70
Support (%) Support (%)
Dataset = pumsb Dataset = pumsb_star
1000 100
ECLATd ECLATd
FP FP
Total Execution Time (sec)
Total Execution Time (sec)
OP OP
100 kDCI kDCI
10
10
1
1
0.1 0.1
65 70 75 80 85 90 95 25 30 35 40 45 50
Support (%) Support (%)
Figure 5. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of
the support threshold.
say P , is a non-key pattern, kDCI searches in Fk−1 the (k-2)-prefix (we called them generators), kDCI gener-
pattern Q \ {pdiff }, where pdiff is stored with P . ates a candidate k-itemset Q. For very dense datasets,
most of the frequent patterns belonging to Fk−1 are non-
If Q \ {pdiff } is found, then Q is a frequent non-
key patterns. Therefore one or both patterns Pa and Pb
key pattern (see Theorem 2), its support is supp(Q \ used to generate Q ∈ Ck are likely to be non-key pat-
{pdiff }), and the item to store with Q is exactly pdiff . terns. In such cases, in order to find a non-key pattern
In fact, Q0 = Q \ {pdiff } ∈ [Q], i.e. pdiff is one of the and then apply Corollary 2, it is not necessary to check
items that we can remove from Q to obtain a subset Q0 the existence of further subsets of Q. For most of the
belonging to the same equivalence class. candidates, a single binary search in Fk−1 , to look for
The worst case is when all the subsets of Q in Fk−1 the pattern Q \ {pdiff }, is thus sufficient to compute
are key patterns and the support of Q cannot be inferred supp(Q). Moreover, often Q \ {pdiff } is exactly equal
from its subsets. In this case kDCI counts the support to one of the two k-1-itemsets belonging to the gener-
of Q as usual, and applies Theorem 1 to determine if ating pair (Pa , Pb ): in this case kDCI does not need to
Q is a non-key-pattern. If Q is a non-key-pattern, its perform any search at all to compute supp(Q).
support becomes supp(Q) = minp∈Q (supp(Q \ {p})) We conclude this section with some examples of how
(see Theorem 1), while the item to be stored with Q is the counting inference technique works. Let us con-
pdiff , i.e. the item to be subtracted from Q to obtain the sider Figure 4. Itemset Q = {A, B, E} is a non-key
pattern with the minimum support. pattern because P = {B, E} is a non-key pattern as
The impact of this counting inference technique on well. So, if P 0 = {B}, kDCI will store pdiff =
the performance of an FSC algorithm becomes evident if E with P . We have that the supp({A, B, E}) =
you consider the Apriori-like candidate generation strat- supp({A, B, E}\({B, E}\{B})) = supp({A, B, E}\
egy adopted by kDCI. From the combination of every {pdiff } = supp({A, B}). From the Figure you can
pair of itemsets Pa and Pb ∈ Fk−1 , that share the same see that {A, B, E} and {A, B} both belong to the same
Dataset = mushroom Dataset = BMS_View_1
1000 1000
ECLATd ECLATd
Total Execution Time (sec) FP FP
Total Execution Time (sec)
OP OP
100 kDCI 100 kDCI
10 10
1 1
0.1 0.1
2 4 6 8 10 12 14 16 18 20 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09
Support (%) Support (%)
Dataset = T25I10D10K Dataset = T30I16D400K
100 1000
ECLATd ECLATd
FP FP
Total Execution Time (sec)
Total Execution Time (sec)
OP OP
kDCI kDCI
10
100
1
0.1 10
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.6 0.8 1 1.2 1.4 1.6
Support (%) Support (%)
Figure 6. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of
the support threshold.
equivalence class. with a Pentium IV 1800 MHz processor, 368MB of
Another example is itemset Q = {A, B, C, E}, that RAM memory and an eide hard disk. For the tests,
is generated by the two non-key patterns {A, B, C} we used both synthetic and real-world datasets. All
and {A, B, E}. Suppose that P = {A, B, C}, i.e. the synthetic datasets used were created with the IBM
the first generator, while P 0 = {A, B}. In this dataset generator [1], while all the real-world datasets
case kDCI will store pdiff = C with P . We have but one were downloaded from the UCI KDD Archive
that the supp({A, B, C, E}) = supp({A, B, C, E} \ (http://kdd.ics.uci.edu/). We also extracted
({A, B, C} \ {A, B})) = supp({A, B, C, E} \ a real-world dataset from the TREC WT10g corpus [2].
{pdiff } = supp({A, B, E}, where {A, B, E} is exactly The original corpus contained about 1.69 millions of
the second generator. In this case, no search is necessary WEB documents. The dataset for our tests was built by
to find {A, B, E}. Looking at the Figure, it is possible considering the set of all the terms contained in each
to verify that {A, B, C, E} and {A, B, E} both belong document as a transaction. Before generating the trans-
to the same equivalence class. actional dataset, the collection of documents was filtered
by removing HTML tags and the most common words
(stopwords), and by applying a stemming algorithm. The
3 Experimental Results resulting trec dataset is huge. It is about 1.3GB, and
contains 1.69 millions of short and long transactions,
We experimentally evaluated kDCI performances by where the maximum length of a transaction is 71, 473
comparing its execution time with respect to the origi- items.
nal implementations of state of the art FSC algorithms,
namely FP-growth (FP) [6], Opportunistic Projection
(OP) [7] and Eclat with diffsets (Eclatd) [14], provided kDCI performance and comparisons. Figure 5 and
by their respective authors. 6 report the total execution time obtained running FP,
We used a MS-WindowsXP workstation equipped Eclatd, OP, and kDCI on various datasets as a func-
tion of the support threshold s. On all datasets in Fig-
ure 5, connect, chess pumsb and pumsb star, Dataset = trec
kDCI runs faster than the others algorithms. On pumsb 10000
kDCI
its execution time is very similar to the one of OP. For
Total Execution Time (sec)
high support thresholds kDCI can drastically prune the
dataset, and build a compact vertical dataset, whose
tidlists presents large similarities. Such similarity of 1000
tidlists is effectively exploited by our strategy for com-
pact datasets. For smaller supports, the benefits intro-
duced by the counting inference strategy become more
evident, particularly for the pumsb star and con- 100
7 8 9 10 11 12 13 14 15
nect datasets. In these cases the number of frequent
Support (%)
itemsets is much higher than the number of key-patterns,
thus allowing kDCI to drastically reduce the number of
(b)
intersections needed to determine candidate supports.
On the datasets mushroom and T30I16D400K
(see Figure 6), kDCI outperforms the other competi- Dataset = USCensus1990
1000
tors, and this also holds on the real-world dataset kDCI
Total Execution Time (sec)
BMS View 1 when mined with very small support
thresholds (see Figure 6). On only a dataset, namely
T25I10D10K, FP and OP are faster than kDCI for all
100
the supports. The reason of this behavior is the size
of the candidate set C3 , which for this dataset is much
larger than F3 . While kDCI has to carry out a lot of use-
less work to determine the support of many candidate
itemsets which are not frequent, FP-growth and OP take 10
74 76 78 80 82 84 86 88 90
advantage of the fact that they do not require candidate Support (%)
generation.
Furthermore, differently from FP, Eclatd, and OP, (a)
kDCI can efficiently mine huge datasets such as trec
Figure 7. Total execution time of kDCI: on
and USCensus1990. Figure 7 reports the total execu-
datasets trec (a) and USCensus1990 (b)
tion time required by kDCI to mine these datasets with
as a function of the support.
different support thresholds. The other algorithms failed
in mining these datasets due to memory shortage, also
when very large support thresholds were used. On the
other hand, kDCI was able to mine such huge datasets reducing the cost of mining at run–time. Dataset pruning
since it adapts its behavior to both the size of the dataset and effective out-of-core techniques are exploited dur-
and the main memory available. ing the count-based phase, while the intersection-based
phase, which starts only when the pruned dataset can fit
4 Conclusions and Future Work into the main memory, exploits a novel technique based
on the notion of key-pattern that in many cases allows to
Due to the complexity of the problem, a good algo- infer the support of an itemset without any counting.
rithm for FSC has to implement multiple strategies and kDCI also adopts compressed data structures and dy-
some level of adaptiveness in order to be able to succes- namic type selection to adapt itself to the characteristics
fully manage diverse and differently featured inputs. of the dataset being mined.
kDCI uses different approaches for extracting fre- The experimental evaluation demonstrated that kDCI
quent patterns: count-based during the first iterations outperforms FP, OP, and Eclatd in most cases. More-
and intersection-based for the following ones. over, differently from the other FSC algorithms tested,
Moreover, a new counting inference strategy, to- kDCI can efficiently manage very large datasets, also on
gether with, adaptiveness and resource awareness are the machines with limited physical memory.
main innovative features of the algorithm. Although the variety of datasets used and the large
On the basis of the characteristics of the mined amount of tests conducted permit us to state that the per-
dataset, kDCI chooses which optimization to adopt for formance of kDCI is not highly influenced by dataset
characteristics, and that our optimizations are very effec- [9] S. Orlando, P. Palmerini, and R. Perego. On Statistical
tive and general, some further optimizations and future Properties of Transactional Datasets. In 2004 ACM Sym-
work will reasonably improve kDCI performance. More posium on Applied Computing (SAC 2004), 2004. To
optimized data structures could be used to store itemset appear.
[10] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
collections in order to make faster searches in such col-
Adaptive and Resource-Aware Mining of Frequent Sets.
lections. Note that such fast searches are very important
In Proc. The 2002 IEEE International Conference on
in kDCI, which bases its count inference technique at Data Mining (ICDM’02), pages 338–345, 2002.
level k on searching for frequent itemset in Fk−1 . Fi- [11] J. S. Park, M.-S. Chen, and P. S. Yu. An Effective Hash
nally, we can benefit from a higher level of adaptive- Based Algorithm for Mining Association Rules. In Proc.
ness to the available memory on the machine, either of the 1995 ACM SIGMOD Int. Conf. on Management of
with fully memory mapped data structures or with out- Data, pages 175–186, 1995.
of-core ones, depending on the data size. This should [12] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Dis-
allow a better scalability and a wider applicability of the covering frequent closed itemsets for association rules.
algorithm. Lecture Notes in Computer Science, 1540:398–416,
1999.
[13] J. Pei, J. Han, H. Lu, S. Nishio, and D. Tang, S.
5 Acknowledgments amd Yang. H-Mine: Hyper-Structure Mining of Fre-
quent Patterns in Large Databases. In Proc. The
2001 IEEE International Conference on Data Mining
We acknowledge J. Han, Y. Pan, M.J. Zaki and J. Liu
(ICDM’01), San Jose, CA, USA, 2000.
for kindly providing us the latest versions of their FSC [14] M. J. Zaki and K. Gouda. Fast Vertical Mining Using
software. Diffsets. In 9th Int. Conf. on Knowledge Discovery and
Data Mining, Washington, DC, 2003.
References
[1] R. Agrawal and R. Srikant. Fast Algorithms for Mining
Association Rules in Large Databases. In Proc. of the
20th VLDB Conf., pages 487–499, 1994.
[2] P. Bailey, N. Craswell, and D. Hawking. Engineering
a multi-purpose test collection for Web retrieval exper-
iments. Information Processing and Management. to
appear.
[3] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and
L. Lakhal. Mining frequent patterns with counting infer-
ence. ACM SIGKDD Explorations Newsletter, 2(2):66–
75, December 2000.
[4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic
itemset counting and implication rules for market basket
data. In J. Peckham, editor, SIGMOD 1997, Proceed-
ings ACM SIGMOD International Conference on Man-
agement of Data, May 13-15, 1997, Tucson, Arizona,
USA. ACM Press, 05.
[5] B. Goethals. Efficient Frequent Itemset Mining. PhD
thesis, Limburg University, Belgium, 2003.
[6] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns
without Candidate Generation. In Proc. of the ACM SIG-
MOD Int. Conf. on Management of Data, pages 1–12,
Dallas, Texas, USA, 2000.
[7] J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent
Item Sets by Opportunistic Projection. In Proc. 2002 Int.
Conf. on Knowledge Discovery in Databases (KDD’02),
Edmonton, Canada, 2002.
[8] S. Orlando, P. Palmerini, and R. Perego. Enhancing the
Apriori Algorithm for Frequent Set Counting. In Proc. of
3rd Int. Conf. on Data Warehousing and Knowledge Dis-
covery (DaWaK 01) - Munich, Germany, volume 2114 of
LNCS, pages 71–82. Springer, 2001.