=Paper= {{Paper |id=Vol-90/paper-11 |storemode=property |title=kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets |pdfUrl=https://ceur-ws.org/Vol-90/palmerini.pdf |volume=Vol-90 |dblpUrl=https://dblp.org/rec/conf/fimi/OrlandoLPPS03 }} ==kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets== https://ceur-ws.org/Vol-90/palmerini.pdf
                  kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets

                               Claudio Lucchese1 , Salvatore Orlando1 , Paolo Palmerini1,2 ,
                                         Raffaele Perego2 , Fabrizio Silvestri2,3

1                                      2                                               3
    Dipartimento di Informatica          ISTI-CNR                                          Dipartimento di Informatica
    Università Ca’ Foscari di Venezia Consiglio Nazionale delle                           Università di Pisa
    Venezia, Italy                       Ricerche                                          Pisa, Italy
    {orlando,clucches}@dsi.unive.it      Pisa, Italy                                       silvestri@di.unipi.it
                                                {r.perego,p.palmerini}@isti.cnr.it




                            Abstract                              the minimum support threshold are said to be frequent.
                                                                     The complexity of the FSC problem relies mainly
        This paper presents the implementation of kDCI, an        in the potentially explosive growth of its full search
     enhancement of DCI [10], a scalable algorithm for dis-       space, whose dimension d is, in the worst case, d =
     covering frequent sets in large databases.                   P|tmax | |I|
                                                                     k=1     k , where tmax is the maximum transac-
        The main contribution of kDCI resides on a novel          tion length. Taking into account the minimum support
     counting inference strategy, inspired by previously          threshold, it is possible to reduce the search space, using
     known results by Basted et al. [3]. Moreover, multiple       the well known downward closure relation, which states
     heuristics and efficient data structures are used in or-     that an itemset can only be frequent if all its subsets
     der to adapt the algorithm behavior to the features of       are frequent as well. The exploitation of this property,
     the specific dataset mined and of the computing platform     originally introduced in the Apriori algorithm [1], has
     used.                                                        transformed a potentially exponentially complex prob-
        kDCI turns out to be effective in mining both short       lem, into a more tractable one.
     and long patterns from a variety of datasets. We con-
                                                                      Nevertheless, the Apriori property alone is not suf-
     ducted a wide range of experiments on synthetic and
                                                                  ficient to permit to solve the FSC problem in a rea-
     real-world datasets, both in-core and out-of-core. The
                                                                  sonable time, in all cases, i.e. on all possible datasets
     results obtained allow us to state that kDCI perfor-
                                                                  and for all possible interesting values of s. Indeed, an-
     mances are not over-fitted to a special case, and its
                                                                  other source of complexity in the FSC problem resides
     high performance is maintained on datasets with differ-
                                                                  in the dataset internal correlation and statistical proper-
     ent characteristics.
                                                                  ties, which remain unknown until the mining is com-
                                                                  pleted. Such diversity in the dataset properties is re-
                                                                  flected in measurable quantities, like the total number
     1   Introduction                                             of transactions, or the total number of distinct items |I|
                                                                  appearing in the database, but also in some other more
         Despite the considerable amount of algorithms pro-       fuzzy properties which, although commonly recognized
     posed in the last decade for solving the problem of find-    as important, still lack a formal and univocal definition.
     ing frequent patterns in transactional databases (among      It is the case, for example, of the notion of how dense a
     the many we mention [1] [11] [6] [13] [14] [4] [3] [7]),     dataset is, i.e. how much its transactions tend to resem-
     a single best approach still has to be found.                ble among one another.
         The Frequent Set Counting (FSC) problem consists             Several important results have been achieved for spe-
     in finding all the set of items (itemsets) which occur in    cific cases. Dense datasets are effectively mined with
     at least s% (s is called support) of the transactions of a   compressed data structure [14], explosion in the candi-
     database D, where each transaction is a variable length      dates can be avoided using effective projections of the
     collection of items from a set I. Itemsets which verify      dataset [7], the support of itemsets in compact datasets
can be inferred, without counting, using an equivalence      Algorithm 1 kDCI
class based partition of the dataset [3].                    Require: D, min supp
    In order to take advantage of all these, and more spe-     // During first scan get optimization figures
cific results, hybrid approaches have been proposed [5].       F1 = first scan(D, min supp)
Critical to this point is when and how to adopt a given        // second and following scans on a temporary db D0
solution instead of another. In lack of a complete theo-       F2 = second scan(D0 , min supp)
retical understanding of the FSC problem, the only so-         k=2
                                                               while (D0 .vertical size() > memory available()) do
lution is to adopt a heuristic approach, where theoretical
                                                                  k++
reasoning is supported by direct experience leading to a          // count-based iteration
strategy that tries to cover a variety of cases as wide as        Fk = DCP(D’, min supp, k)
possible.                                                      end while
    Starting from the previous DCI (Direct Count & In-         k++
tersect) algorithm [10] we propose here kDCI, an en-           // count-based iteration + create vertical database VD
hanced version of DCI that extends its adaptability to the     Fk = DCP(D’, VD, min supp, k)
dataset specific features and the hardware characteristics     dense = V D.is dense())
of the computing platform used for running the FSC al-         while (Fk 6= ∅) do
gorithm. Moreover, in kDCI we introduce a novel count-            k++
                                                                  if (use key patterns()) then
ing inference strategy, based on a new result inspired by
                                                                     if (dense) then
the work of Bastide et al. in [3].
                                                                        Fk = DCI dense keyp(VD, min supp, k)
    kDCI is a multiple heuristics hybrid algorithm, able             else
to adapt its behavior during the execution. Since it ori-               Fk = DCI sparse keyp(VD, min supp, k)
gins from the already published DCI algorithm, we only               end if
outline in this paper how kDCI differs from DCI. A de-            else
tailed description of the DCI algorithm can be found                 if (dense) then
in [10].                                                                Fk = DCI dense(VD, min supp, k)
                                                                     else
                                                                        Fk = DCI sparse(VD, min supp, k)
2   The kDCI algorithm                                               end if
                                                                  end if
    Several considerations concerning the features of real     end while
datasets, the characteristics of modern hw/sw system, as
well as scalability issues of FSC algorithms have moti-
vated the design of kDCI. As already pointed out, trans-     nique to remove infrequent items and short transactions.
actional databases may have different characteristics in     A temporary dataset is therefore written to disk at every
terms of correlations among the items inside transac-        iteration. The first steps of the algorithm are described
tions and of transactions among themselves [9]. A de-        in [8] and [10] and remain unchanged in kDCI. In kDCI
sirable feature of an FSC algorithm should be the ability    we only improved memory management by exploiting
to adapt its behavior to these characteristics.              compressed and optimized data structures (see Section
    Modern hw/sw systems need high locality for ex-          2.1 and 2.2).
ploiting memory hierarchies effectively and achieving            The effectiveness of pruning is related to the possi-
high performance. Algorithms have to favor the ex-           bility of storing the dataset in main memory in vertical
ploitation of spatial and temporal locality in accessing     format, due to the dataset size reduction. This normally
in-core and out-core data.                                   occurs at the first iterations, depending on the dataset,
    Scalability is the main concern in designing algo-       the support threshold and the memory available on the
rithms that aim to mine large databases efficiently.         machine, which is determined at run time.
Therefore, it is important to be able to handle datasets         Once the dataset can be stored in main memory, kDCI
bigger than the available memory.                            switches to the vertical representation, and applies sev-
    We designed and implemented our algorithm kDCI           eral heuristics in order to determine the most effective
keeping in mind such performance issues. The pseudo          strategy for frequent itemset counting.
code of kDCI is given in Algorithm 1.                            The most important innovation introduced in kDCI
    kDCI inherits from DCI the level-wise behavior and       regards a novel technique to determine the itemset sup-
the hybrid horizontal-vertical dataset representation. As    ports, inspired by the work of Bastide et al. [3]. As we
computation is started, kDCI maintains the database in       will discuss in Section 2.4, in some cases the support of
horizontal format and applies an effective pruning tech-     candidate itemsets can be determined without actually
                                                              prefix            index       suffix
counting transactions, but by a faster inference reason-
ing.                                                           a b c               3          d   0       Compressed
   Moreover, kDCI maintains the different strategies           a b d               7          e   1
                                                                                                      Memory = 9 + 3 + 10 = 21
                                                               b d f               9          f   2
implemented in DCI for sparse and dense datasets. The                                         g   3
result is a multiple strategy approach: during the execu-                                     h   4
                                                                                              i   5
tion kDCI collects statistical information on the dataset                                     l   6
that allows to determine which is the best approach for                                       m   7
                                                              a b c d                         i   8
the particular case.                                          a b c e                         n   9
                                                              a b c f
   In the following we detail such optimizations and im-      a b c g
                                                              a b d h
provements and the heuristics used to decide which op-        a b d i                                   Non−Compressed
                                                              a b d l
timization to use.                                            a b d m                                 Memory = 4 x 10 = 40
                                                              b d f i
                                                              b d f n

2.1 Dynamic data type selection
                                                                 Figure 1. Compressed data structure used
    The first optimization is concerned with the amount          for itemset collection can reduce the
of memory used to represent itemsets and their counters.         amount of memory needed to store the
Since such structures are extensively accessed during the        itemsets.
execution of the algorithm, is it profitable to have such
data occupying as little memory as possible. This not
only allows to reduce the spatial complexity of the algo-
rithm, but also permits low level processor optimizations     frequent itemsets. This representation take advantage
to be effective at run time.                                  of prefix sharing among the lexicographically ordered
    During the first scan of the dataset, global properties   itemsets of the collection.
are collected like the total number of distinct frequent          The compressed data structure is based on three ar-
items (m1 ), the maximum transaction size, and the sup-       rays (Figure 1). At each iteration k, the first array (pre-
port of the most frequent item.                               fix) stores the different prefixes of length k − 1. In the
    Once this information is available, we remap the sur-     third array (suffix) all the length-1 suffixes are stored.
vived (frequent) items to contiguous integer identifiers.     Finally, in the element i of the second array (index),
This allows us to decide the best data type to represent      we store the position in the suffix array of the section
such identifiers and their counters. For example if the       of suffixes that share the same prefix. Therefore, when
maximum support of any item is less than 65536, we can        the itemsets in the collection have to be enumerated, we
use an unsigned short int to represent the item-              first access the prefix array. Then, from the corre-
set counters. The same holds for the remapped identi-         sponding entry in the index array we get the section
fiers of the items. The decision of which is the most         of suffixes stored in suffix, needed to complete the
appropriate type to use for items and counters is taken at    itemsets.
run time, by means of a C++ template-based implemen-              From our tests we can say that, in all the interesting
tation of all the kDCI code.                                  cases – i.e., when the number of candidate (or frequent)
    Before remapping item identifiers, we also reorder        iemsets explodes – this data structure works well and
them in increasingly support ordering: more frequent          achieves up to 30% as compression ratio. For example,
items are thus assigned larger identifiers. This also sim-    see the results reported in Figure 2.
plifies the intersection-based technique used for dense
datasets (see Section 2.3).                                   2.3   Heuristics

2.2 Compressed data structures                                    One of the most important features of kDCI is its abil-
                                                              ity to adapt its behavior to the dataset specific character-
   Itemsets are often organized in collections in many        istics. It is well known that being able to distinguish be-
FSC algorithms. Efficient representation of such collec-      tween sparse and dense datasets, for example, allows to
tions can lead to important performance improvements.         adopt specific and effective optimizations. Moreover, as
In [8] we pointed out the advantages of storing candi-        we will explain the Section 2.4, if the number of frequent
dates in directly accessible data structures for the first    itemsets is much greater than the number of closed item-
passes of our algorithm. In kDCI we introduce a com-          sets, it is possible to apply a counting inference proce-
pressed representation of an itemset collection, used to      dure that allows to dramatically reduce the time needed
store in the main memory collections of candidate and         to determine itemset supports.
                                                                                                                                                                                                                   δ = 0.2

                                      BMS_View, min_supp=0.06%                                                                                                t=0..N                                                                                                               t=0..N
                120
                                                                  compressed
                                                              non-compressed
                100
                                                                                                                          d=pf/M=0.9 x 0.25 = 0.23                                                                                                      d=pf/M=0.5 x 0.3 = 0.15
                                                                                         i=0..M                                                                                                                        i=0..M
                 80
                                                                                                                                                                                                                           
                                                                                                                                                                                                                           
                                                                                                                                                                                                                            
                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                 
                                                                                               
                                                                                                                                                                      
                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                           
                                                                                                     
                                                                                                                                                                                       
                                                                                                                                                                                                                              
                                                                                                                                                                                                                              
                                                                                                                                                                                                                                
                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                        
                                                                                                                                                                     
       KBytes




                                                                                       f=M/4              
                                                                                                                     
                                                                                                                               
                                                                                                                                         
                                                                                                                                                   
                                                                                                                                                             
                                                                                                                                                                       
                                                                                                                                                                                 
                                                                                                                                                                                           
                                                                                                                                                                                                      
                                                                                                                                                                                                                   f=M/3 
                                                                                                                                                                                                                                                                  
                 60

                                                                                                                                                                                                                                
                                                                                                                                                                                                                                
                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                 
                 40


                 20
                                                                                                                                                    p=90%                                                                                                                        p=50%
                                                                                                                                       d> δ                                DENSE                                                                                 d< δ                         SPARSE
                  0
                      3   4       5        6         7    8         9        10   11                                                                   (a)                                                                                                                            (b)
                                                     k

                                               (a)
                                                                                                        Figure 3. Heuristic to establish a dataset
                                        connect, min_supp=80%                                           density or sparsity
                700
                                                                  compressed
                                                              non-compressed
                600


                500
                                                                                         to a density of d = f p = 0.25 × 0.9 = 0.23 which is
                400
                                                                                         above the density threshold. For dataset (b) on the other
       KBytes




                300                                                                      hand, to a smaller intersection of p = 50% is common
                200                                                                      to f = 1/3 of the items. In this last case the density
                100
                                                                                         d = f p = 0.3 × 0.5 = 0.15 is lower than the threshold
                  0
                                                                                         and the dataset is considered as sparse. It is worth to no-
                      2       4        6             8
                                                     k
                                                              10        12        14
                                                                                         tice that since this notion of density depends on the min-
                                               (b)                                       imum support threshold, the same dataset can exhibits
                                                                                         different behaviors when mined with different support
   Figure 2. Memory usage with compressed                                                thresholds.
   itemsets collection representation for BMS
   with min sup=0.06% (a) and connect with                                                  Once the dataset density is determined, we adopted
   min sup=80% (b)                                                                       the same optimizations described in [10] for sparse and
                                                                                         dense datasets. We review them briefly for complete-
                                                                                         ness.

   In kDCI we devised two main heuristics that allow to                                                           Sparse datasets. The main techniques used for
distinguish between dense and sparse datasets and to de-                                                          sparse datasets can be summarized as follows:
cide whether to apply the counting inference procedure
or not.                                                                                                                          – projection. Tidlists in sparse datasets are
   The first heuristic is simply based on the measure of                                                                           characterized by long runs of 0’s. When in-
the dataset density. Namely, we measure the correlation                                                                            tersecting the tidlists associated with the 2-
among the tidlists corresponding to the most frequent                                                                              prefix items belonging to a given candidate
items. We require that the maximum number of frequent                                                                              itemset, we keep track of such empty ele-
items for which such correlation is significant, weighted                                                                          ments (words), in order to perform the follow-
by the correlation degree itself, is above a given thresh-                                                                         ing intersections faster. This can be consid-
old.                                                                                                                               ered as a sort of raw projection of the verti-
   As an example, consider the two dataset in Figure 3,                                                                            cal dataset, since some transactions, i.e. those
where tidlists are placed horizontally, i.e. rows corre-                                                                           corresponding to zero words, are not consid-
spond to items and columns to transactions. Suppose                                                                                ered at all during the following tidlist inter-
to choose a density threshold δ = 0.2. If we order the                                                                             sections.
items according to their support, we have the most dense                                                                         – pruning. We remove infrequent items from
region of the dataset at the bottom of each figure. Start-                                                                         the dataset. This can result in some transac-
ing from the bottom, we find the maximum number of                                                                                 tion remaining empty or with too few items.
items whose tidlists have a significant intersection. In                                                                           We therefore remove such transactions (i.e.
the case of dataset (a), for example, a fraction f = 1/4                                                                           columns in the our bitvector vertical repre-
of the items share p = 90% of the transactions, leading                                                                            sentation) from the dataset. Since this bitwise
           pruning may be expensive, we only perform
           it when the benefits introduced are expected
           to balance its cost.

      Dense datasets.
      If the dataset is dense, we expect to deal with
      strong correlations among the most frequent items.
      This not only means that the tidlists associated
      with these most frequent items contain long runs
      of 1’s, but also that they turn out to be very similar.
      The heuristic technique adopted by DCI and con-
      sequently by kDCI for dense dataset thus works as
      follows:

         – we reorder the columns of the vertical dataset
           by moving identical segments of the tidlists
           associated with the most frequent items to the
           first consecutive positions;
         – since each candidate is likely to include sev-          Figure 4. Example lattice of frequent items.
           eral of these most frequent items, we avoid
           repeated intersections of identical segments.

    The heuristic for density evaluation is applied only        introduced in kDCI. We exploit a technique inspired by
once, as soon as the vertical dataset is built. After this      the theoretical results presented in [3], where the PAS -
decision is taken, we further check if the counting infer-      CAL algorithm was introduced. PASCAL is able to infers
ence strategy (see Section 2.4) can be profitable or not.       the support of an itemset without actually count its oc-
The effectiveness of the inference strategy depends on          currences in the database. In this seminal work, the au-
the ratio between the total number of frequent itemsets         thors introduced the concept of key pattern (or key item-
and how many of them are key-patterns. The closer to 1          set). Given a generic pattern Q, it is possible to deter-
this ratio is, the less advantage is introduced by the in-      mine an equivalence class [Q], which contains the set of
ference strategy. Since this ratio is not known until the       patterns that have the same support and are included in
computation is finished, we found that the same infor-          the same set of database transactions. Moreover, if we
mation can be derived from the average support of the           define min[P ] as the set of the smallest itemsets in [P ],
frequent singletons (items), after the first scan. The idea     a pattern P is a key pattern if P ∈ min[P ], i.e. no proper
behind this is that if the average support of the single        subset of P is in the same equivalence class. Note that
items that survived the first scan is high enough, then         we can have several key patterns for each equivalence
longer patterns can be expected to be frequent and more         class. Figure 4 shows an example of a lattice of frequent
likely the number of key-patterns itemsets will be lower        itemsets, taken from [3], where equivalence classes and
than that of frequent itemsets. We experimentally veri-         key patterns are highlighted.
fied that this simple heuristic gives the correct output for       Given an equivalence class [P ], we can also define a
all datasets - both real and synthetic.                         corresponding closed set [12]: the closed set c of [P ] is
    To resume the rationale behind kDCI multiple strat-         equal to max[P ], so that no proper supersets of c can
egy approach, if the key-patterns optimization can be           belong to the same equivalence class [P ].
adopted, we use the counting inference method that al-             Among the results illustrated in [3] we have the fol-
lows to avoid many intersections. For the intersections         lowing important theorems:
that cannot be avoided and in the cases where the key-
patterns inference method cannot be applied, we further         Theorem 1 Q is a key pattern iff supp(Q)                6=
distinguish between sparse and dense datasets, and ap-          minp∈Q (supp(Q \ {p})).
ply the two strategies explained above.
                                                                Theorem 2 If P is not a key pattern, and P ⊆ Q, then
2.4    Pattern Counting Inference                               Q is a non-key pattern as well.

  In this section we describe the count inference                  From Theorem 1 it is straightforward to observe that
method, which constitute the most important innovation          if Q is a non-key pattern, then:
                                                               to f (Q) ⊇ f (Q \ (P \ P 0 ) ∪ P 0 ). Since P 0 ⊆ (Q \ (P \
           supp(Q) = min(supp(Q \ {p})).                (1)    P 0 )) clearly holds, it follows that f (Q\(P \P 0 )∪P 0 ) =
                        p∈Q
                                                               f (Q\(P \P 0 )). So we can conclude that f (Q) ⊇ f (Q\
    Moreover, Theorem 1 says that we can check                 (P \ P 0 )), which completes the proof.
whether Q is a key pattern by comparing its support            2
with the minimum support of its proper subsets, i.e.
minp∈Q (supp(Q \ {p})). We will show in the following            The following corollary is trivial, since we defined
how to use this property to make faster candidate support      supp(Q) = |f (Q)|.
counting.
    Theorems 1 and 2 give the theoretical foundations          Corollary 1 If P is a non-key pattern, and P ⊆ Q, the
for the PASCAL algorithm, which finds the support of           support of Q can be computed as follows:
a non-key k-candidate Q by simply searching the mini-
mum supports of all its k − 1 subsets. Note that such                      supp(Q) = supp(Q \ (P \ P 0 ))
search can be performed during the pruning phase of            where P 0 and P , P 0 ⊂ P , belong to the same equiva-
the Apriori candidate generation. DCI does not perform         lence class, i.e. P, P 0 ∈ [P ].
candidate pruning because its intersection technique is
comparably faster. For this reason we will not adopt the          Finally, we can introduce Corollary 2, which is a par-
PASCAL counting inference in kDCI.                             ticular case of the previous one.
    The following theorem, partially inspired by the
proof of Theorem 2, suggests a faster way to compute           Corollary 2 If Q is k-candidate (i.e. Q ∈ Ck ) and
the support of a non-key k-candidate Q.                        P , P ⊂ Q, is a frequent non-key (k-1)-pattern (i.e.
    Before introducing the theorem, we need to define          P ∈ Fk−1 ), there must exist P 0 ∈ Fk−2 , P 0 ⊂ P , such
the function f , which assigns to each pattern P the set       that P and P 0 belong to the same equivalence class,
of all the transactions that include this pattern. We can      i.e. P, P 0 ∈ [P ] and P and P 0 differ for a single item:
define the support of a pattern in terms of f : supp(P ) =     {pdiff } = P \ P 0 . The support of Q can thus be com-
|f (P )|. Note that f is a monotonically decreasing func-      puted as:
tion, i.e. if P1 ⊆ P2 ⇒ f (P2 ) ⊆ f (P1 ). This is obvi-
ous, because every transaction containing P2 surely con-
tains all the subsets of P2 .                                  supp(Q) = supp(Q \ (P \ P 0 )) = supp(Q \ {pdiff })

Theorem 3 If P is a non-key pattern and P ⊆ Q, the                Corollary 2 says that to find the support of a non-
following holds:                                               key candidate pattern Q, we can simply check whether
                                                               Q \ {pdiff } belongs to Fk−1 , or not. If Q \ {pdiff } ∈
               f (Q) = f (Q \ (P \ P 0 )).                     Fk−1 , then Q inherits the same support as Q \ {pdiff }
where P 0 ⊂ P , and P and P 0 belong to the same equiv-        and is therefore frequent. Otherwise we can conclude
alence class, i.e. P, P 0 ∈ [P ].                              that Q \ {pdiff } is not frequent.
                                                                  Using the theoretical result of Corollary 2, we
P ROOF. Note that, since P is a non-key pattern, it is         adopted the following strategy in order to determine the
surely possible to find a pattern P 0 , P 0 ⊂ P , belonging    support of a candidate Q at step k.
to the same equivalence class [P ].                               In kDCI, we store with each itemset P ∈ Fk−1 the
    In order to demonstrate the Theorem we first show          following information:
that f (Q) ⊆ f (Q \ (P \ P 0 )) and then that also
f (Q) ⊇ f (Q \ (P \ P 0 )) holds, thus proving the Theo-         • supp(P );
rem hypotheses.
    The first assertion f (Q) ⊆ f (Q \ (P \ P 0 )) holds         • a flag indicating if P is a key pattern or not;
because (Q \ (P \ P 0 )) ⊆ Q, and f is a monotonically
decreasing function.                                             • if P is non-key pattern, also the item pdiff such that
    To prove the second assertion, f (Q) ⊇ f (Q \ (P \             P \ {pdiff } = P 0 ∈ [P ].
P 0 )), we can rewrite f (Q) as f (Q\(P \P 0 )∪(P \P 0 )),
which is equivalent to f (Q \ (P \ P 0 )) ∩ f (P \ P 0 ).      Note that pdiff must be one of the items that we can re-
    Since f is decreasing, f (P ) ⊆ f (P \ P 0 ). But, since   move from P to obtain a proper subset P 0 of P , belong-
P, P 0 ∈ [P ], then we can write f (P ) = f (P 0 ) ⊆ f (P \    ing to the same equivalence class.
P 0 ). Therefore f (Q) = f (Q \ (P \ P 0 )) ∩ f (P \ P 0 ) ⊇      During the generation of a generic candidate Q ∈ Ck ,
f (Q\(P \P 0 ))∩f (P 0 ). The last inequality is equivalent    as soon as kDCI discovers that one of the subsets of Q,
                                                                  Dataset = connect                                                                              Dataset = chess
                                       10000                                                                                          10000
                                                                                  ECLATd                                                                                       ECLATd
          Total Execution Time (sec)                                                  FP                                                                                           FP




                                                                                                         Total Execution Time (sec)
                                                                                      OP                                               1000                                        OP
                                        1000                                        kDCI                                                                                          kDCI

                                                                                                                                        100
                                         100
                                                                                                                                            10

                                          10
                                                                                                                                             1


                                              1                                                                                         0.1
                                                  50   55   60        65   70    75    80    85   90                                             30   35    40   45   50    55      60    65   70
                                                                       Support (%)                                                                                Support (%)


                                                                  Dataset = pumsb                                                                            Dataset = pumsb_star
                                       1000                                                                                           100
                                                                                  ECLATd                                                                                       ECLATd
                                                                                      FP                                                                                           FP
          Total Execution Time (sec)




                                                                                                         Total Execution Time (sec)
                                                                                      OP                                                                                           OP
                                        100                                         kDCI                                                                                         kDCI
                                                                                                                                       10

                                         10

                                                                                                                                        1
                                          1


                                        0.1                                                                                           0.1
                                              65       70        75       80      85        90    95                                        25         30        35       40         45        50
                                                                      Support (%)                                                                                Support (%)



   Figure 5. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of
   the support threshold.



say P , is a non-key pattern, kDCI searches in Fk−1 the                                                (k-2)-prefix (we called them generators), kDCI gener-
pattern Q \ {pdiff }, where pdiff is stored with P .                                                   ates a candidate k-itemset Q. For very dense datasets,
                                                                                                       most of the frequent patterns belonging to Fk−1 are non-
   If Q \ {pdiff } is found, then Q is a frequent non-
                                                                                                       key patterns. Therefore one or both patterns Pa and Pb
key pattern (see Theorem 2), its support is supp(Q \                                                   used to generate Q ∈ Ck are likely to be non-key pat-
{pdiff }), and the item to store with Q is exactly pdiff .                                             terns. In such cases, in order to find a non-key pattern
In fact, Q0 = Q \ {pdiff } ∈ [Q], i.e. pdiff is one of the                                             and then apply Corollary 2, it is not necessary to check
items that we can remove from Q to obtain a subset Q0                                                  the existence of further subsets of Q. For most of the
belonging to the same equivalence class.                                                               candidates, a single binary search in Fk−1 , to look for
   The worst case is when all the subsets of Q in Fk−1                                                 the pattern Q \ {pdiff }, is thus sufficient to compute
are key patterns and the support of Q cannot be inferred                                               supp(Q). Moreover, often Q \ {pdiff } is exactly equal
from its subsets. In this case kDCI counts the support                                                 to one of the two k-1-itemsets belonging to the gener-
of Q as usual, and applies Theorem 1 to determine if                                                   ating pair (Pa , Pb ): in this case kDCI does not need to
Q is a non-key-pattern. If Q is a non-key-pattern, its                                                 perform any search at all to compute supp(Q).
support becomes supp(Q) = minp∈Q (supp(Q \ {p}))                                                          We conclude this section with some examples of how
(see Theorem 1), while the item to be stored with Q is                                                 the counting inference technique works. Let us con-
pdiff , i.e. the item to be subtracted from Q to obtain the                                            sider Figure 4. Itemset Q = {A, B, E} is a non-key
pattern with the minimum support.                                                                      pattern because P = {B, E} is a non-key pattern as
   The impact of this counting inference technique on                                                  well. So, if P 0 = {B}, kDCI will store pdiff =
the performance of an FSC algorithm becomes evident if                                                 E with P . We have that the supp({A, B, E}) =
you consider the Apriori-like candidate generation strat-                                              supp({A, B, E}\({B, E}\{B})) = supp({A, B, E}\
egy adopted by kDCI. From the combination of every                                                     {pdiff } = supp({A, B}). From the Figure you can
pair of itemsets Pa and Pb ∈ Fk−1 , that share the same                                                see that {A, B, E} and {A, B} both belong to the same
                                                                  Dataset = mushroom                                                                       Dataset = BMS_View_1
                                       1000                                                                                             1000
                                                                                  ECLATd                                                                                         ECLATd
          Total Execution Time (sec)                                                  FP                                                                                             FP




                                                                                                           Total Execution Time (sec)
                                                                                      OP                                                                                             OP
                                        100                                         kDCI                                                 100                                        kDCI


                                         10                                                                                               10


                                             1                                                                                             1


                                        0.1                                                                                              0.1
                                                 2     4    6      8    10 12 14        16    18    20                                     0.055   0.06   0.065    0.07 0.075     0.08   0.085   0.09
                                                                       Support (%)                                                                                 Support (%)


                                                                 Dataset = T25I10D10K                                                                      Dataset = T30I16D400K
                                       100                                                                                              1000
                                                                                  ECLATd                                                                                         ECLATd
                                                                                      FP                                                                                             FP
          Total Execution Time (sec)




                                                                                                           Total Execution Time (sec)
                                                                                      OP                                                                                             OP
                                                                                    kDCI                                                                                           kDCI
                                        10

                                                                                                                                         100

                                         1




                                       0.1                                                                                                10
                                          0.1        0.2   0.3   0.4    0.5 0.6 0.7     0.8   0.9   1                                       0.4     0.6      0.8       1       1.2       1.4     1.6
                                                                       Support (%)                                                                                 Support (%)



    Figure 6. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of
    the support threshold.



equivalence class.                                                                                       with a Pentium IV 1800 MHz processor, 368MB of
Another example is itemset Q = {A, B, C, E}, that                                                        RAM memory and an eide hard disk. For the tests,
is generated by the two non-key patterns {A, B, C}                                                       we used both synthetic and real-world datasets. All
and {A, B, E}. Suppose that P = {A, B, C}, i.e.                                                          the synthetic datasets used were created with the IBM
the first generator, while P 0 = {A, B}. In this                                                         dataset generator [1], while all the real-world datasets
case kDCI will store pdiff = C with P . We have                                                          but one were downloaded from the UCI KDD Archive
that the supp({A, B, C, E}) = supp({A, B, C, E} \                                                        (http://kdd.ics.uci.edu/). We also extracted
({A, B, C} \ {A, B})) = supp({A, B, C, E} \                                                              a real-world dataset from the TREC WT10g corpus [2].
{pdiff } = supp({A, B, E}, where {A, B, E} is exactly                                                    The original corpus contained about 1.69 millions of
the second generator. In this case, no search is necessary                                               WEB documents. The dataset for our tests was built by
to find {A, B, E}. Looking at the Figure, it is possible                                                 considering the set of all the terms contained in each
to verify that {A, B, C, E} and {A, B, E} both belong                                                    document as a transaction. Before generating the trans-
to the same equivalence class.                                                                           actional dataset, the collection of documents was filtered
                                                                                                         by removing HTML tags and the most common words
                                                                                                         (stopwords), and by applying a stemming algorithm. The
3    Experimental Results                                                                                resulting trec dataset is huge. It is about 1.3GB, and
                                                                                                         contains 1.69 millions of short and long transactions,
   We experimentally evaluated kDCI performances by                                                      where the maximum length of a transaction is 71, 473
comparing its execution time with respect to the origi-                                                  items.
nal implementations of state of the art FSC algorithms,
namely FP-growth (FP) [6], Opportunistic Projection
(OP) [7] and Eclat with diffsets (Eclatd) [14], provided                                                 kDCI performance and comparisons. Figure 5 and
by their respective authors.                                                                             6 report the total execution time obtained running FP,
   We used a MS-WindowsXP workstation equipped                                                           Eclatd, OP, and kDCI on various datasets as a func-
tion of the support threshold s. On all datasets in Fig-
ure 5, connect, chess pumsb and pumsb star,                                                                             Dataset = trec
kDCI runs faster than the others algorithms. On pumsb                                            10000
                                                                                                                                         kDCI
its execution time is very similar to the one of OP. For




                                                                    Total Execution Time (sec)
high support thresholds kDCI can drastically prune the
dataset, and build a compact vertical dataset, whose
tidlists presents large similarities. Such similarity of                                          1000
tidlists is effectively exploited by our strategy for com-
pact datasets. For smaller supports, the benefits intro-
duced by the counting inference strategy become more
evident, particularly for the pumsb star and con-                                                  100
                                                                                                         7    8    9   10   11    12      13     14   15
nect datasets. In these cases the number of frequent
                                                                                                                        Support (%)
itemsets is much higher than the number of key-patterns,
thus allowing kDCI to drastically reduce the number of
                                                                                                                       (b)
intersections needed to determine candidate supports.
    On the datasets mushroom and T30I16D400K
(see Figure 6), kDCI outperforms the other competi-                                                               Dataset = USCensus1990
                                                                                                 1000
tors, and this also holds on the real-world dataset                                                                                       kDCI




                                                                    Total Execution Time (sec)
BMS View 1 when mined with very small support
thresholds (see Figure 6). On only a dataset, namely
T25I10D10K, FP and OP are faster than kDCI for all
                                                                                                  100
the supports. The reason of this behavior is the size
of the candidate set C3 , which for this dataset is much
larger than F3 . While kDCI has to carry out a lot of use-
less work to determine the support of many candidate
itemsets which are not frequent, FP-growth and OP take                                             10
                                                                                                        74   76   78   80   82    84      86     88   90
advantage of the fact that they do not require candidate                                                                Support (%)
generation.
    Furthermore, differently from FP, Eclatd, and OP,                                                                  (a)
kDCI can efficiently mine huge datasets such as trec
                                                                  Figure 7. Total execution time of kDCI: on
and USCensus1990. Figure 7 reports the total execu-
                                                                  datasets trec (a) and USCensus1990 (b)
tion time required by kDCI to mine these datasets with
                                                                  as a function of the support.
different support thresholds. The other algorithms failed
in mining these datasets due to memory shortage, also
when very large support thresholds were used. On the
other hand, kDCI was able to mine such huge datasets           reducing the cost of mining at run–time. Dataset pruning
since it adapts its behavior to both the size of the dataset   and effective out-of-core techniques are exploited dur-
and the main memory available.                                 ing the count-based phase, while the intersection-based
                                                               phase, which starts only when the pruned dataset can fit
4   Conclusions and Future Work                                into the main memory, exploits a novel technique based
                                                               on the notion of key-pattern that in many cases allows to
    Due to the complexity of the problem, a good algo-         infer the support of an itemset without any counting.
rithm for FSC has to implement multiple strategies and            kDCI also adopts compressed data structures and dy-
some level of adaptiveness in order to be able to succes-      namic type selection to adapt itself to the characteristics
fully manage diverse and differently featured inputs.          of the dataset being mined.
    kDCI uses different approaches for extracting fre-            The experimental evaluation demonstrated that kDCI
quent patterns: count-based during the first iterations        outperforms FP, OP, and Eclatd in most cases. More-
and intersection-based for the following ones.                 over, differently from the other FSC algorithms tested,
    Moreover, a new counting inference strategy, to-           kDCI can efficiently manage very large datasets, also on
gether with, adaptiveness and resource awareness are the       machines with limited physical memory.
main innovative features of the algorithm.                        Although the variety of datasets used and the large
    On the basis of the characteristics of the mined           amount of tests conducted permit us to state that the per-
dataset, kDCI chooses which optimization to adopt for          formance of kDCI is not highly influenced by dataset
characteristics, and that our optimizations are very effec-          [9] S. Orlando, P. Palmerini, and R. Perego. On Statistical
tive and general, some further optimizations and future                  Properties of Transactional Datasets. In 2004 ACM Sym-
work will reasonably improve kDCI performance. More                      posium on Applied Computing (SAC 2004), 2004. To
optimized data structures could be used to store itemset                 appear.
                                                                    [10] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
collections in order to make faster searches in such col-
                                                                         Adaptive and Resource-Aware Mining of Frequent Sets.
lections. Note that such fast searches are very important
                                                                         In Proc. The 2002 IEEE International Conference on
in kDCI, which bases its count inference technique at                    Data Mining (ICDM’02), pages 338–345, 2002.
level k on searching for frequent itemset in Fk−1 . Fi-             [11] J. S. Park, M.-S. Chen, and P. S. Yu. An Effective Hash
nally, we can benefit from a higher level of adaptive-                   Based Algorithm for Mining Association Rules. In Proc.
ness to the available memory on the machine, either                      of the 1995 ACM SIGMOD Int. Conf. on Management of
with fully memory mapped data structures or with out-                    Data, pages 175–186, 1995.
of-core ones, depending on the data size. This should               [12] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Dis-
allow a better scalability and a wider applicability of the              covering frequent closed itemsets for association rules.
algorithm.                                                               Lecture Notes in Computer Science, 1540:398–416,
                                                                         1999.
                                                                    [13] J. Pei, J. Han, H. Lu, S. Nishio, and D. Tang, S.
5      Acknowledgments                                                   amd Yang. H-Mine: Hyper-Structure Mining of Fre-
                                                                         quent Patterns in Large Databases. In Proc. The
                                                                         2001 IEEE International Conference on Data Mining
   We acknowledge J. Han, Y. Pan, M.J. Zaki and J. Liu
                                                                         (ICDM’01), San Jose, CA, USA, 2000.
for kindly providing us the latest versions of their FSC            [14] M. J. Zaki and K. Gouda. Fast Vertical Mining Using
software.                                                                Diffsets. In 9th Int. Conf. on Knowledge Discovery and
                                                                         Data Mining, Washington, DC, 2003.
References

    [1] R. Agrawal and R. Srikant. Fast Algorithms for Mining
        Association Rules in Large Databases. In Proc. of the
        20th VLDB Conf., pages 487–499, 1994.
    [2] P. Bailey, N. Craswell, and D. Hawking. Engineering
        a multi-purpose test collection for Web retrieval exper-
        iments. Information Processing and Management. to
        appear.
    [3] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and
        L. Lakhal. Mining frequent patterns with counting infer-
        ence. ACM SIGKDD Explorations Newsletter, 2(2):66–
        75, December 2000.
    [4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic
        itemset counting and implication rules for market basket
        data. In J. Peckham, editor, SIGMOD 1997, Proceed-
        ings ACM SIGMOD International Conference on Man-
        agement of Data, May 13-15, 1997, Tucson, Arizona,
        USA. ACM Press, 05.
    [5] B. Goethals. Efficient Frequent Itemset Mining. PhD
        thesis, Limburg University, Belgium, 2003.
    [6] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns
        without Candidate Generation. In Proc. of the ACM SIG-
        MOD Int. Conf. on Management of Data, pages 1–12,
        Dallas, Texas, USA, 2000.
    [7] J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent
        Item Sets by Opportunistic Projection. In Proc. 2002 Int.
        Conf. on Knowledge Discovery in Databases (KDD’02),
        Edmonton, Canada, 2002.
    [8] S. Orlando, P. Palmerini, and R. Perego. Enhancing the
        Apriori Algorithm for Frequent Set Counting. In Proc. of
        3rd Int. Conf. on Data Warehousing and Knowledge Dis-
        covery (DaWaK 01) - Munich, Germany, volume 2114 of
        LNCS, pages 71–82. Springer, 2001.