=Paper= {{Paper |id=Vol-126/paper-11 |storemode=property |title=kDCI: on using direct count up to the third iteration |pdfUrl=https://ceur-ws.org/Vol-126/kdci.pdf |volume=Vol-126 |dblpUrl=https://dblp.org/rec/conf/fimi/LuccheseOP04a }} ==kDCI: on using direct count up to the third iteration== https://ceur-ws.org/Vol-126/kdci.pdf
              kDCI: on using direct count up to the third iteration

            Claudio Lucchese                    Salvatore Orlando                                 Raffaele Perego
         HPC Lab of ISTI-CNR                   Università Ca’ Foscari                        HPC Lab of ISTI-CNR
               Pisa, Italy.                       Venezia, Italy.                                   Pisa, Italy.
       claudio.lucchese@isti.cnr.it            orlando@dsi.unive.it                         raffaere.perego@isti.cnr.it


1. Problem Statement and Solution                                                                Time spent for each itaration of kDCI
                                                                               12
                                                                                                                                 kDCI
                                                                                                                                 kDCI with 3rd iteration direct count
   In Apriori-like algorithms, one of the most consum-
ing operation during the frequent itemset mining pro-                          10


cess is the candidate search. At each iteration k the
whole dataset D has to be scanned, and for each trans-                          8


action t in the database, every of its subsets of length


                                                                 time (sec.)
k is generated and searched within the candidates. If                           6


a candidate is matched, it means that the transaction
subsumes the candidate, and therefore its support can                           4

be incremented by one.
   This search is very time demanding even if appropri-                         2

ate data structures are used to gain a logarithmic cost.
In [3, 2] we introduced a direct count technique which                          0
                                                                                    0   2    4             6               8             10           12                14
allows constant time searches for candidates of length                                                         iteration


2. Given the set of n frequent single items, candidates         Figure 1. The dataset used was T10I4D100K
of length 2 are stored using an upper triangular matrix         with a minimum absolute support of 10.
n × n DC2 with ( n2 ) cells, such that DC2 (i, j) stored
the support of the 2-itemset {ij}. As shown in [1] the
direct count procedure can be extended to the third it-      the time spent during the firsts iteration is significant
                                                             for the global effectiveness of the algorithm on most
eration using an n × n × n matrix DC3 with ( n3 ) cells,
                                                             datasets. In fact, in the test performed, the total time
where DC3 (i, j, l) is the support of the 3-itemset {ijl}.   was reduced from 19 sec. to 14 sec., which means an
   We thus introduced such technique in the last ver-        overall speed up of about 20%.
sion of kDCI, which is level-wise hybrid algorithm.             We acknowledge the authors C.Targa, A.Prado and
kDCI stores the dataset with an horizontal format            A.Plastino of [1], who showed the effectiveness of such
to disk during the first iterations. After some itera-       optimization.
tion the dataset may become small enough (thanks to
anti-monotone frequency pruning) to be stored in the         References
main memory in a vertical format, and after that the
algorithm goes on performing tid-lists intersections to      [1] C.Targa, A.Prado, and A.Plastino. Improving direct
retrieve itemsets supports, and searches among can-              counting for frequent set mining. Technical report, In-
didates are not needed anymore. Usually the dataset              stituto de Computação, UFF RT-02/09 2003.
happens to be small enough at most at the fourth iter-       [2] Claudio Lucchese, Salvatore Orlando, Paolo Palmerini,
ation.                                                           Raffaele Perego, and Fabrizio Silvestri. kdci: a multi-
                                                                 strategy algorithm for mining frequent sets. In Proceed-
                                                                 ings of the IEEE ICDM Workshop on Frequent Itemset
2. Experiments and Conclusion                                    Mining Implementations, November 2003.
                                                             [3] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
   The experiments show that the improvement given               Adaptive and resource-aware mining of frequent sets. In
by this optimization is sensible in some cases. The              Proc. The 2002 IEEE International Conference on Data
time needed for the third iteration is halved. Note that         Mining (ICDM02), page 338345, 2002.