<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Lucchese</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Orlando</string-name>
          <email>orlando@dsi.unive.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Palmerini</string-name>
          <email>p.palmerini@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raffaele Perego</string-name>
          <email>r.perego@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Silvestri</string-name>
          <email>silvestri@di.unipi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica Universita` Ca' Foscari di Venezia Venezia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Informatica Universita` di Pisa Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ISTI-CNR Consiglio Nazionale delle Ricerche Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>the minimum support threshold are said to be frequent. The complexity of the FSC problem relies mainly in the potentially explosive growth of its full search space, whose dimension d is, in the worst case, d = P|tmax| |kI| , where tmax is the maximum transack=1 tion length. Taking into account the minimum support threshold, it is possible to reduce the search space, using the well known downward closure relation, which states that an itemset can only be frequent if all its subsets are frequent as well. The exploitation of this property, originally introduced in the Apriori algorithm [1], has transformed a potentially exponentially complex problem, into a more tractable one.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper presents the implementation of kDCI, an
enhancement of DCI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a scalable algorithm for
discovering frequent sets in large databases.
      </p>
      <p>
        The main contribution of kDCI resides on a novel
counting inference strategy, inspired by previously
known results by Basted et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Moreover, multiple
heuristics and efficient data structures are used in
order to adapt the algorithm behavior to the features of
the specific dataset mined and of the computing platform
used.
      </p>
      <p>kDCI turns out to be effective in mining both short
and long patterns from a variety of datasets. We
conducted a wide range of experiments on synthetic and
real-world datasets, both in-core and out-of-core. The
results obtained allow us to state that kDCI
performances are not over-fitted to a special case, and its
high performance is maintained on datasets with
different characteristics.</p>
      <p>Nevertheless, the Apriori property alone is not
sufficient to permit to solve the FSC problem in a
reasonable time, in all cases, i.e. on all possible datasets
and for all possible interesting values of s. Indeed,
another source of complexity in the FSC problem resides
in the dataset internal correlation and statistical
properties, which remain unknown until the mining is
completed. Such diversity in the dataset properties is
reflected in measurable quantities, like the total number
of transactions, or the total number of distinct items |I|
appearing in the database, but also in some other more
fuzzy properties which, although commonly recognized
as important, still lack a formal and univocal definition.
It is the case, for example, of the notion of how dense a
dataset is, i.e. how much its transactions tend to
resemble among one another.</p>
      <p>
        Several important results have been achieved for
specific cases. Dense datasets are effectively mined with
compressed data structure [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], explosion in the
candidates can be avoided using effective projections of the
dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the support of itemsets in compact datasets
can be inferred, without counting, using an equivalence
class based partition of the dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In order to take advantage of all these, and more
specific results, hybrid approaches have been proposed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Critical to this point is when and how to adopt a given
solution instead of another. In lack of a complete
theoretical understanding of the FSC problem, the only
solution is to adopt a heuristic approach, where theoretical
reasoning is supported by direct experience leading to a
strategy that tries to cover a variety of cases as wide as
possible.
      </p>
      <p>
        Starting from the previous DCI (Direct Count &amp;
Intersect) algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] we propose here kDCI, an
enhanced version of DCI that extends its adaptability to the
dataset specific features and the hardware characteristics
of the computing platform used for running the FSC
algorithm. Moreover, in kDCI we introduce a novel
counting inference strategy, based on a new result inspired by
the work of Bastide et al. in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        kDCI is a multiple heuristics hybrid algorithm, able
to adapt its behavior during the execution. Since it
origins from the already published DCI algorithm, we only
outline in this paper how kDCI differs from DCI. A
detailed description of the DCI algorithm can be found
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>The kDCI algorithm</title>
      <p>
        Several considerations concerning the features of real
datasets, the characteristics of modern hw/sw system, as
well as scalability issues of FSC algorithms have
motivated the design of kDCI. As already pointed out,
transactional databases may have different characteristics in
terms of correlations among the items inside
transactions and of transactions among themselves [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A
desirable feature of an FSC algorithm should be the ability
to adapt its behavior to these characteristics.
      </p>
      <p>Modern hw/sw systems need high locality for
exploiting memory hierarchies effectively and achieving
high performance. Algorithms have to favor the
exploitation of spatial and temporal locality in accessing
in-core and out-core data.</p>
      <p>Scalability is the main concern in designing
algorithms that aim to mine large databases efficiently.
Therefore, it is important to be able to handle datasets
bigger than the available memory.</p>
      <p>We designed and implemented our algorithm kDCI
keeping in mind such performance issues. The pseudo
code of kDCI is given in Algorithm 1.</p>
      <p>kDCI inherits from DCI the level-wise behavior and
the hybrid horizontal-vertical dataset representation. As
computation is started, kDCI maintains the database in
horizontal format and applies an effective pruning
tech</p>
      <sec id="sec-2-1">
        <title>Algorithm 1 kDCI</title>
        <p>Require: D, min supp
// During first scan get optimization figures
F1 = first scan(D, min supp)
// second and following scans on a temporary db D0
F2 = second scan(D0, min supp)
k = 2
while (D0.vertical size() &gt; memory available()) do
k + +
// count-based iteration</p>
        <p>Fk = DCP(D’, min supp, k)
end while
k + +
// count-based iteration + create vertical database VD
Fk = DCP(D’, VD, min supp, k)
dense = V D.is dense())
while (Fk 6= ∅) do
k + +
if (use key patterns()) then
if (dense) then</p>
        <p>Fk = DCI dense keyp(VD, min supp, k)
else</p>
        <p>Fk = DCI sparse keyp(VD, min supp, k)
end if
else
if (dense) then</p>
        <p>Fk = DCI dense(VD, min supp, k)
else</p>
        <p>
          Fk = DCI sparse(VD, min supp, k)
end if
end if
end while
nique to remove infrequent items and short transactions.
A temporary dataset is therefore written to disk at every
iteration. The first steps of the algorithm are described
in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and remain unchanged in kDCI. In kDCI
we only improved memory management by exploiting
compressed and optimized data structures (see Section
2.1 and 2.2).
        </p>
        <p>The effectiveness of pruning is related to the
possibility of storing the dataset in main memory in vertical
format, due to the dataset size reduction. This normally
occurs at the first iterations, depending on the dataset,
the support threshold and the memory available on the
machine, which is determined at run time.</p>
        <p>Once the dataset can be stored in main memory, kDCI
switches to the vertical representation, and applies
several heuristics in order to determine the most effective
strategy for frequent itemset counting.</p>
        <p>
          The most important innovation introduced in kDCI
regards a novel technique to determine the itemset
supports, inspired by the work of Bastide et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. As we
will discuss in Section 2.4, in some cases the support of
candidate itemsets can be determined without actually
counting transactions, but by a faster inference
reasoning.
        </p>
        <p>Moreover, kDCI maintains the different strategies
implemented in DCI for sparse and dense datasets. The
result is a multiple strategy approach: during the
execution kDCI collects statistical information on the dataset
that allows to determine which is the best approach for
the particular case.</p>
        <p>In the following we detail such optimizations and
improvements and the heuristics used to decide which
optimization to use.
2.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Dynamic data type selection</title>
        <p>The first optimization is concerned with the amount
of memory used to represent itemsets and their counters.
Since such structures are extensively accessed during the
execution of the algorithm, is it profitable to have such
data occupying as little memory as possible. This not
only allows to reduce the spatial complexity of the
algorithm, but also permits low level processor optimizations
to be effective at run time.</p>
        <p>During the first scan of the dataset, global properties
are collected like the total number of distinct frequent
items (m1), the maximum transaction size, and the
support of the most frequent item.</p>
        <p>Once this information is available, we remap the
survived (frequent) items to contiguous integer identifiers.
This allows us to decide the best data type to represent
such identifiers and their counters. For example if the
maximum support of any item is less than 65536, we can
use an unsigned short int to represent the
itemset counters. The same holds for the remapped
identifiers of the items. The decision of which is the most
appropriate type to use for items and counters is taken at
run time, by means of a C++ template-based
implementation of all the kDCI code.</p>
        <p>Before remapping item identifiers, we also reorder
them in increasingly support ordering: more frequent
items are thus assigned larger identifiers. This also
simplifies the intersection-based technique used for dense
datasets (see Section 2.3).
2.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Compressed data structures</title>
        <p>
          Itemsets are often organized in collections in many
FSC algorithms. Efficient representation of such
collections can lead to important performance improvements.
In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] we pointed out the advantages of storing
candidates in directly accessible data structures for the first
passes of our algorithm. In kDCI we introduce a
compressed representation of an itemset collection, used to
store in the main memory collections of candidate and
prefix
a b c
a b d
b d f
a b c d
a b c e
a b c f
a b c g
a b d h
a b d i
a b d l
a b d m
b d f i
b d f n
index
        </p>
        <p>suffix
frequent itemsets. This representation take advantage
of prefix sharing among the lexicographically ordered
itemsets of the collection.</p>
        <p>The compressed data structure is based on three
arrays (Figure 1). At each iteration k, the first array
(prefix) stores the different prefixes of length k − 1. In the
third array (suffix) all the length-1 suffixes are stored.
Finally, in the element i of the second array (index),
we store the position in the suffix array of the section
of suffixes that share the same prefix. Therefore, when
the itemsets in the collection have to be enumerated, we
first access the prefix array. Then, from the
corresponding entry in the index array we get the section
of suffixes stored in suffix, needed to complete the
itemsets.</p>
        <p>From our tests we can say that, in all the interesting
cases – i.e., when the number of candidate (or frequent)
iemsets explodes – this data structure works well and
achieves up to 30% as compression ratio. For example,
see the results reported in Figure 2.
2.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Heuristics</title>
        <p>One of the most important features of kDCI is its
ability to adapt its behavior to the dataset specific
characteristics. It is well known that being able to distinguish
between sparse and dense datasets, for example, allows to
adopt specific and effective optimizations. Moreover, as
we will explain the Section 2.4, if the number of frequent
itemsets is much greater than the number of closed
itemsets, it is possible to apply a counting inference
procedure that allows to dramatically reduce the time needed
to determine itemset supports.</p>
        <p>BMS_View, min_supp=0.06%</p>
        <p>non-ccoommpprreesssseedd
4
5
6
8
9
10</p>
        <p>11
i=0..M
f=M/3
t=0..N
p=50%
(b)
d &lt; δ</p>
        <p>SPARSE
(a)</p>
        <p>In kDCI we devised two main heuristics that allow to
distinguish between dense and sparse datasets and to
decide whether to apply the counting inference procedure
or not.</p>
        <p>The first heuristic is simply based on the measure of
the dataset density. Namely, we measure the correlation
among the tidlists corresponding to the most frequent
items. We require that the maximum number of frequent
items for which such correlation is significant, weighted
by the correlation degree itself, is above a given
threshold.</p>
        <p>As an example, consider the two dataset in Figure 3,
where tidlists are placed horizontally, i.e. rows
correspond to items and columns to transactions. Suppose
to choose a density threshold δ = 0.2. If we order the
items according to their support, we have the most dense
region of the dataset at the bottom of each figure.
Starting from the bottom, we find the maximum number of
items whose tidlists have a significant intersection. In
the case of dataset (a), for example, a fraction f = 1/4
of the items share p = 90% of the transactions, leading
to a density of d = f p = 0.25 × 0.9 = 0.23 which is
above the density threshold. For dataset (b) on the other
hand, to a smaller intersection of p = 50% is common
to f = 1/3 of the items. In this last case the density
d = f p = 0.3 × 0.5 = 0.15 is lower than the threshold
and the dataset is considered as sparse. It is worth to
notice that since this notion of density depends on the
minimum support threshold, the same dataset can exhibits
different behaviors when mined with different support
thresholds.</p>
        <p>
          Once the dataset density is determined, we adopted
the same optimizations described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for sparse and
dense datasets. We review them briefly for
completeness.
        </p>
        <p>Sparse datasets. The main techniques used for
sparse datasets can be summarized as follows:
– projection. Tidlists in sparse datasets are
characterized by long runs of 0’s. When
intersecting the tidlists associated with the
2prefix items belonging to a given candidate
itemset, we keep track of such empty
elements (words), in order to perform the
following intersections faster. This can be
considered as a sort of raw projection of the
vertical dataset, since some transactions, i.e. those
corresponding to zero words, are not
considered at all during the following tidlist
intersections.
– pruning. We remove infrequent items from
the dataset. This can result in some
transaction remaining empty or with too few items.
We therefore remove such transactions (i.e.
columns in the our bitvector vertical
representation) from the dataset. Since this bitwise
pruning may be expensive, we only perform
it when the benefits introduced are expected
to balance its cost.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Dense datasets.</title>
        <p>If the dataset is dense, we expect to deal with
strong correlations among the most frequent items.
This not only means that the tidlists associated
with these most frequent items contain long runs
of 1’s, but also that they turn out to be very similar.
The heuristic technique adopted by DCI and
consequently by kDCI for dense dataset thus works as
follows:
– we reorder the columns of the vertical dataset
by moving identical segments of the tidlists
associated with the most frequent items to the
first consecutive positions;
– since each candidate is likely to include
several of these most frequent items, we avoid
repeated intersections of identical segments.</p>
        <p>The heuristic for density evaluation is applied only
once, as soon as the vertical dataset is built. After this
decision is taken, we further check if the counting
inference strategy (see Section 2.4) can be profitable or not.
The effectiveness of the inference strategy depends on
the ratio between the total number of frequent itemsets
and how many of them are key-patterns. The closer to 1
this ratio is, the less advantage is introduced by the
inference strategy. Since this ratio is not known until the
computation is finished, we found that the same
information can be derived from the average support of the
frequent singletons (items), after the first scan. The idea
behind this is that if the average support of the single
items that survived the first scan is high enough, then
longer patterns can be expected to be frequent and more
likely the number of key-patterns itemsets will be lower
than that of frequent itemsets. We experimentally
verified that this simple heuristic gives the correct output for
all datasets - both real and synthetic.</p>
        <p>To resume the rationale behind kDCI multiple
strategy approach, if the key-patterns optimization can be
adopted, we use the counting inference method that
allows to avoid many intersections. For the intersections
that cannot be avoided and in the cases where the
keypatterns inference method cannot be applied, we further
distinguish between sparse and dense datasets, and
apply the two strategies explained above.
2.4</p>
      </sec>
      <sec id="sec-2-6">
        <title>Pattern Counting Inference</title>
        <p>
          introduced in kDCI. We exploit a technique inspired by
the theoretical results presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], where the
PASCAL algorithm was introduced. PASCAL is able to infers
the support of an itemset without actually count its
occurrences in the database. In this seminal work, the
authors introduced the concept of key pattern (or key
itemset). Given a generic pattern Q, it is possible to
determine an equivalence class [Q], which contains the set of
patterns that have the same support and are included in
the same set of database transactions. Moreover, if we
define min[P ] as the set of the smallest itemsets in [P ],
a pattern P is a key pattern if P ∈ min[P ], i.e. no proper
subset of P is in the same equivalence class. Note that
we can have several key patterns for each equivalence
class. Figure 4 shows an example of a lattice of frequent
itemsets, taken from [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], where equivalence classes and
key patterns are highlighted.
        </p>
        <p>
          Given an equivalence class [P ], we can also define a
corresponding closed set [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: the closed set c of [P ] is
equal to max[P ], so that no proper supersets of c can
belong to the same equivalence class [P ].
        </p>
        <p>
          Among the results illustrated in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] we have the
following important theorems:
Theorem 1 Q is a key pattern iff supp(Q)
minp∈Q(supp(Q \ {p})).
6=
Theorem 2 If P is not a key pattern, and P ⊆ Q, then
Q is a non-key pattern as well.
        </p>
        <p>In this section we describe the count inference
method, which constitute the most important innovation
From Theorem 1 it is straightforward to observe that
if Q is a non-key pattern, then:
supp(Q) = min(supp(Q \ {p})).</p>
        <p>p∈Q
(1)</p>
        <p>Moreover, Theorem 1 says that we can check
whether Q is a key pattern by comparing its support
with the minimum support of its proper subsets, i.e.
minp∈Q(supp(Q \ {p})). We will show in the following
how to use this property to make faster candidate support
counting.</p>
        <p>Theorems 1 and 2 give the theoretical foundations
for the PASCAL algorithm, which finds the support of
a non-key k-candidate Q by simply searching the
minimum supports of all its k − 1 subsets. Note that such
search can be performed during the pruning phase of
the Apriori candidate generation. DCI does not perform
candidate pruning because its intersection technique is
comparably faster. For this reason we will not adopt the
PASCAL counting inference in kDCI.</p>
        <p>The following theorem, partially inspired by the
proof of Theorem 2, suggests a faster way to compute
the support of a non-key k-candidate Q.</p>
        <p>Before introducing the theorem, we need to define
the function f , which assigns to each pattern P the set
of all the transactions that include this pattern. We can
define the support of a pattern in terms of f : supp(P ) =
|f (P )|. Note that f is a monotonically decreasing
function, i.e. if P1 ⊆ P2 ⇒ f (P2) ⊆ f (P1). This is
obvious, because every transaction containing P2 surely
contains all the subsets of P2.</p>
        <p>Theorem 3 If P is a non-key pattern and P ⊆ Q, the
following holds:</p>
        <p>f (Q) = f (Q \ (P \ P 0)).
where P 0 ⊂ P , and P and P 0 belong to the same
equivalence class, i.e. P, P 0 ∈ [P ].</p>
        <p>PROOF. Note that, since P is a non-key pattern, it is
surely possible to find a pattern P 0, P 0 ⊂ P , belonging
to the same equivalence class [P ].</p>
        <p>In order to demonstrate the Theorem we first show
that f (Q) ⊆ f (Q \ (P \ P 0)) and then that also
f (Q) ⊇ f (Q \ (P \ P 0)) holds, thus proving the
Theorem hypotheses.</p>
        <p>The first assertion f (Q) ⊆ f (Q \ (P \ P 0)) holds
because (Q \ (P \ P 0)) ⊆ Q, and f is a monotonically
decreasing function.</p>
        <p>To prove the second assertion, f (Q) ⊇ f (Q \ (P \
P 0)), we can rewrite f (Q) as f (Q \ (P \ P 0) ∪ (P \ P 0)),
which is equivalent to f (Q \ (P \ P 0)) ∩ f (P \ P 0).</p>
        <p>Since f is decreasing, f (P ) ⊆ f (P \ P 0). But, since
P, P 0 ∈ [P ], then we can write f (P ) = f (P 0) ⊆ f (P \
P 0). Therefore f (Q) = f (Q \ (P \ P 0)) ∩ f (P \ P 0) ⊇
f (Q\(P \P 0))∩f (P 0). The last inequality is equivalent
to f (Q) ⊇ f (Q \ (P \ P 0) ∪ P 0). Since P 0 ⊆ (Q \ (P \
P 0)) clearly holds, it follows that f (Q\(P \P 0)∪P 0) =
f (Q \ (P \ P 0)). So we can conclude that f (Q) ⊇ f (Q \
(P \ P 0)), which completes the proof.
2</p>
        <p>The following corollary is trivial, since we defined
supp(Q) = |f (Q)|.</p>
        <p>Corollary 1 If P is a non-key pattern, and P ⊆ Q, the
support of Q can be computed as follows:</p>
        <p>supp(Q) = supp(Q \ (P \ P 0))
where P 0 and P , P 0 ⊂ P , belong to the same
equivalence class, i.e. P, P 0 ∈ [P ].</p>
        <p>Finally, we can introduce Corollary 2, which is a
particular case of the previous one.</p>
        <p>Corollary 2 If Q is k-candidate (i.e. Q ∈ Ck) and
P , P ⊂ Q, is a frequent non-key (k-1)-pattern (i.e.
P ∈ Fk−1), there must exist P 0 ∈ Fk−2, P 0 ⊂ P , such
that P and P 0 belong to the same equivalence class,
i.e. P, P 0 ∈ [P ] and P and P 0 differ for a single item:
{pdiff} = P \ P 0. The support of Q can thus be
computed as:
supp(Q) = supp(Q \ (P \ P 0)) = supp(Q \ {pdiff})</p>
        <p>Corollary 2 says that to find the support of a
nonkey candidate pattern Q, we can simply check whether
Q \ {pdiff} belongs to Fk−1, or not. If Q \ {pdiff} ∈
Fk−1, then Q inherits the same support as Q \ {pdiff}
and is therefore frequent. Otherwise we can conclude
that Q \ {pdiff} is not frequent.</p>
        <p>Using the theoretical result of Corollary 2, we
adopted the following strategy in order to determine the
support of a candidate Q at step k.</p>
        <p>In kDCI, we store with each itemset P ∈ Fk−1 the
following information:
• supp(P );
• a flag indicating if P is a key pattern or not;
• if P is non-key pattern, also the item pdiff such that</p>
        <p>P \ {pdiff } = P 0 ∈ [P ].</p>
        <p>Note that pdiff must be one of the items that we can
remove from P to obtain a proper subset P 0 of P ,
belonging to the same equivalence class.</p>
        <p>During the generation of a generic candidate Q ∈ Ck,
as soon as kDCI discovers that one of the subsets of Q,</p>
        <p>0.1
100
)
c
e
(s 100
e
m
i
T
iton 10
u
c
e
x
ltaE 1
o
T
0.1
65
70
75
85
90
95
25
30
45
50
80
Support (%)
35 40
Support (%)
say P , is a non-key pattern, kDCI searches in Fk−1 the
pattern Q \ {pdiff }, where pdiff is stored with P .</p>
        <p>If Q \ {pdiff } is found, then Q is a frequent
nonkey pattern (see Theorem 2), its support is supp(Q \
{pdiff }), and the item to store with Q is exactly pdiff.
In fact, Q0 = Q \ {pdiff } ∈ [Q], i.e. pdiff is one of the
items that we can remove from Q to obtain a subset Q0
belonging to the same equivalence class.</p>
        <p>The worst case is when all the subsets of Q in Fk−1
are key patterns and the support of Q cannot be inferred
from its subsets. In this case kDCI counts the support
of Q as usual, and applies Theorem 1 to determine if
Q is a non-key-pattern. If Q is a non-key-pattern, its
support becomes supp(Q) = minp∈Q(supp(Q \ {p}))
(see Theorem 1), while the item to be stored with Q is
pdiff, i.e. the item to be subtracted from Q to obtain the
pattern with the minimum support.</p>
        <p>The impact of this counting inference technique on
the performance of an FSC algorithm becomes evident if
you consider the Apriori-like candidate generation
strategy adopted by kDCI. From the combination of every
pair of itemsets Pa and Pb ∈ Fk−1, that share the same
(k-2)-prefix (we called them generators), kDCI
generates a candidate k-itemset Q. For very dense datasets,
most of the frequent patterns belonging to Fk−1 are
nonkey patterns. Therefore one or both patterns Pa and Pb
used to generate Q ∈ Ck are likely to be non-key
patterns. In such cases, in order to find a non-key pattern
and then apply Corollary 2, it is not necessary to check
the existence of further subsets of Q. For most of the
candidates, a single binary search in Fk−1, to look for
the pattern Q \ {pdiff }, is thus sufficient to compute
supp(Q). Moreover, often Q \ {pdiff } is exactly equal
to one of the two k-1-itemsets belonging to the
generating pair (Pa, Pb): in this case kDCI does not need to
perform any search at all to compute supp(Q).</p>
        <p>We conclude this section with some examples of how
the counting inference technique works. Let us
consider Figure 4. Itemset Q = {A, B, E} is a non-key
pattern because P = {B, E} is a non-key pattern as
well. So, if P 0 = {B}, kDCI will store pdiff =
E with P . We have that the supp({A, B, E}) =
supp({A, B, E}\({B, E}\{B})) = supp({A, B, E}\
{pdiff } = supp({A, B}). From the Figure you can
see that {A, B, E} and {A, B} both belong to the same
equivalence class.</p>
        <p>Another example is itemset Q = {A, B, C, E}, that
is generated by the two non-key patterns {A, B, C}
and {A, B, E}. Suppose that P = {A, B, C}, i.e.
the first generator, while P 0 = {A, B}. In this
case kDCI will store pdiff = C with P . We have
that the supp({A, B, C, E}) = supp({A, B, C, E} \
({A, B, C} \ {A, B})) = supp({A, B, C, E} \
{pdiff } = supp({A, B, E}, where {A, B, E} is exactly
the second generator. In this case, no search is necessary
to find {A, B, E}. Looking at the Figure, it is possible
to verify that {A, B, C, E} and {A, B, E} both belong
to the same equivalence class.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results</title>
      <p>
        We experimentally evaluated kDCI performances by
comparing its execution time with respect to the
original implementations of state of the art FSC algorithms,
namely FP-growth (FP) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Opportunistic Projection
(OP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Eclat with diffsets (Eclatd) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], provided
by their respective authors.
      </p>
      <p>
        We used a MS-WindowsXP workstation equipped
with a Pentium IV 1800 MHz processor, 368MB of
RAM memory and an eide hard disk. For the tests,
we used both synthetic and real-world datasets. All
the synthetic datasets used were created with the IBM
dataset generator [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], while all the real-world datasets
but one were downloaded from the UCI KDD Archive
(http://kdd.ics.uci.edu/). We also extracted
a real-world dataset from the TREC WT10g corpus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The original corpus contained about 1.69 millions of
WEB documents. The dataset for our tests was built by
considering the set of all the terms contained in each
document as a transaction. Before generating the
transactional dataset, the collection of documents was filtered
by removing HTML tags and the most common words
(stopwords), and by applying a stemming algorithm. The
resulting trec dataset is huge. It is about 1.3GB, and
contains 1.69 millions of short and long transactions,
where the maximum length of a transaction is 71, 473
items.
kDCI performance and comparisons. Figure 5 and
6 report the total execution time obtained running FP,
Eclatd, OP, and kDCI on various datasets as a
function of the support threshold s. On all datasets in
Figure 5, connect, chess pumsb and pumsb star,
kDCI runs faster than the others algorithms. On pumsb
its execution time is very similar to the one of OP. For
high support thresholds kDCI can drastically prune the
dataset, and build a compact vertical dataset, whose
tidlists presents large similarities. Such similarity of
tidlists is effectively exploited by our strategy for
compact datasets. For smaller supports, the benefits
introduced by the counting inference strategy become more
evident, particularly for the pumsb star and
connect datasets. In these cases the number of frequent
itemsets is much higher than the number of key-patterns,
thus allowing kDCI to drastically reduce the number of
intersections needed to determine candidate supports.
      </p>
      <p>On the datasets mushroom and T30I16D400K
(see Figure 6), kDCI outperforms the other
competitors, and this also holds on the real-world dataset
BMS View 1 when mined with very small support
thresholds (see Figure 6). On only a dataset, namely
T25I10D10K, FP and OP are faster than kDCI for all
the supports. The reason of this behavior is the size
of the candidate set C3, which for this dataset is much
larger than F3. While kDCI has to carry out a lot of
useless work to determine the support of many candidate
itemsets which are not frequent, FP-growth and OP take
advantage of the fact that they do not require candidate
generation.</p>
      <p>Furthermore, differently from FP, Eclatd, and OP,
kDCI can efficiently mine huge datasets such as trec
and USCensus1990. Figure 7 reports the total
execution time required by kDCI to mine these datasets with
different support thresholds. The other algorithms failed
in mining these datasets due to memory shortage, also
when very large support thresholds were used. On the
other hand, kDCI was able to mine such huge datasets
since it adapts its behavior to both the size of the dataset
and the main memory available.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>Due to the complexity of the problem, a good
algorithm for FSC has to implement multiple strategies and
some level of adaptiveness in order to be able to
succesfully manage diverse and differently featured inputs.</p>
      <p>kDCI uses different approaches for extracting
frequent patterns: count-based during the first iterations
and intersection-based for the following ones.</p>
      <p>Moreover, a new counting inference strategy,
together with, adaptiveness and resource awareness are the
main innovative features of the algorithm.</p>
      <p>On the basis of the characteristics of the mined
dataset, kDCI chooses which optimization to adopt for
reducing the cost of mining at run–time. Dataset pruning
and effective out-of-core techniques are exploited
during the count-based phase, while the intersection-based
phase, which starts only when the pruned dataset can fit
into the main memory, exploits a novel technique based
on the notion of key-pattern that in many cases allows to
infer the support of an itemset without any counting.</p>
      <p>kDCI also adopts compressed data structures and
dynamic type selection to adapt itself to the characteristics
of the dataset being mined.</p>
      <p>The experimental evaluation demonstrated that kDCI
outperforms FP, OP, and Eclatd in most cases.
Moreover, differently from the other FSC algorithms tested,
kDCI can efficiently manage very large datasets, also on
machines with limited physical memory.</p>
      <p>Although the variety of datasets used and the large
amount of tests conducted permit us to state that the
performance of kDCI is not highly influenced by dataset
characteristics, and that our optimizations are very
effective and general, some further optimizations and future
work will reasonably improve kDCI performance. More
optimized data structures could be used to store itemset
collections in order to make faster searches in such
collections. Note that such fast searches are very important
in kDCI, which bases its count inference technique at
level k on searching for frequent itemset in Fk−1.
Finally, we can benefit from a higher level of
adaptiveness to the available memory on the machine, either
with fully memory mapped data structures or with
outof-core ones, depending on the data size. This should
allow a better scalability and a wider applicability of the
algorithm.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We acknowledge J. Han, Y. Pan, M.J. Zaki and J. Liu
for kindly providing us the latest versions of their FSC
software.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast Algorithms for Mining Association Rules in Large Databases</article-title>
          .
          <source>In Proc. of the 20th VLDB Conf.</source>
          , pages
          <fpage>487</fpage>
          -
          <lpage>499</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <article-title>Engineering a multi-purpose test collection for Web retrieval experiments</article-title>
          .
          <source>Information Processing and Management</source>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bastide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taouil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pasquier</surname>
          </string-name>
          , G. Stumme, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Lakhal</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns with counting inference</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>66</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>December 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsur</surname>
          </string-name>
          .
          <article-title>Dynamic itemset counting and implication rules for market basket data</article-title>
          . In J. Peckham, editor,
          <source>SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15</source>
          ,
          <year>1997</year>
          , Tucson, Arizona, USA. ACM Press,
          <volume>05</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          .
          <article-title>Efficient Frequent Itemset Mining</article-title>
          .
          <source>PhD thesis</source>
          , Limburg University, Belgium,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining Frequent Patterns without Candidate Generation</article-title>
          .
          <source>In Proc. of the ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          , Dallas, Texas, USA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          .
          <article-title>Mining Frequent Item Sets by Opportunistic Projection</article-title>
          .
          <source>In Proc. 2002 Int. Conf. on Knowledge Discovery in Databases (KDD'02)</source>
          , Edmonton, Canada,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Palmerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          .
          <article-title>Enhancing the Apriori Algorithm for Frequent Set Counting</article-title>
          .
          <source>In Proc. of 3rd Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK</source>
          <volume>01</volume>
          ) - Munich, Germany, volume
          <volume>2114</volume>
          <source>of LNCS</source>
          , pages
          <fpage>71</fpage>
          -
          <lpage>82</lpage>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Palmerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          .
          <article-title>On Statistical Properties of Transactional Datasets</article-title>
          .
          <source>In 2004 ACM Symposium on Applied Computing (SAC</source>
          <year>2004</year>
          ),
          <year>2004</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Palmerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Adaptive and Resource-Aware Mining of Frequent Sets</article-title>
          .
          <source>In Proc. The 2002 IEEE International Conference on Data Mining (ICDM'02)</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>345</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>An Effective Hash Based Algorithm for Mining Association Rules</article-title>
          .
          <source>In Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>175</fpage>
          -
          <lpage>186</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pasquier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bastide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taouil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Lakhal</surname>
          </string-name>
          .
          <article-title>Discovering frequent closed itemsets for association rules</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          ,
          <volume>1540</volume>
          :
          <fpage>398</fpage>
          -
          <lpage>416</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nishio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            amd
            <surname>Yang. H-Mine</surname>
          </string-name>
          :
          <article-title>Hyper-Structure Mining of Frequent Patterns in Large Databases</article-title>
          .
          <source>In Proc. The 2001 IEEE International Conference on Data Mining (ICDM'01)</source>
          , San Jose, CA, USA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Gouda</surname>
          </string-name>
          .
          <article-title>Fast Vertical Mining Using Diffsets</article-title>
          .
          <source>In 9th Int. Conf. on Knowledge Discovery and Data Mining</source>
          , Washington, DC,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>