=Paper=
{{Paper
|id=Vol-126/paper-11
|storemode=property
|title=kDCI: on using direct count up to the third iteration
|pdfUrl=https://ceur-ws.org/Vol-126/kdci.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/LuccheseOP04a
}}
==kDCI: on using direct count up to the third iteration==
kDCI: on using direct count up to the third iteration
Claudio Lucchese Salvatore Orlando Raffaele Perego
HPC Lab of ISTI-CNR Università Ca’ Foscari HPC Lab of ISTI-CNR
Pisa, Italy. Venezia, Italy. Pisa, Italy.
claudio.lucchese@isti.cnr.it orlando@dsi.unive.it raffaere.perego@isti.cnr.it
1. Problem Statement and Solution Time spent for each itaration of kDCI
12
kDCI
kDCI with 3rd iteration direct count
In Apriori-like algorithms, one of the most consum-
ing operation during the frequent itemset mining pro- 10
cess is the candidate search. At each iteration k the
whole dataset D has to be scanned, and for each trans- 8
action t in the database, every of its subsets of length
time (sec.)
k is generated and searched within the candidates. If 6
a candidate is matched, it means that the transaction
subsumes the candidate, and therefore its support can 4
be incremented by one.
This search is very time demanding even if appropri- 2
ate data structures are used to gain a logarithmic cost.
In [3, 2] we introduced a direct count technique which 0
0 2 4 6 8 10 12 14
allows constant time searches for candidates of length iteration
2. Given the set of n frequent single items, candidates Figure 1. The dataset used was T10I4D100K
of length 2 are stored using an upper triangular matrix with a minimum absolute support of 10.
n × n DC2 with ( n2 ) cells, such that DC2 (i, j) stored
the support of the 2-itemset {ij}. As shown in [1] the
direct count procedure can be extended to the third it- the time spent during the firsts iteration is significant
for the global effectiveness of the algorithm on most
eration using an n × n × n matrix DC3 with ( n3 ) cells,
datasets. In fact, in the test performed, the total time
where DC3 (i, j, l) is the support of the 3-itemset {ijl}. was reduced from 19 sec. to 14 sec., which means an
We thus introduced such technique in the last ver- overall speed up of about 20%.
sion of kDCI, which is level-wise hybrid algorithm. We acknowledge the authors C.Targa, A.Prado and
kDCI stores the dataset with an horizontal format A.Plastino of [1], who showed the effectiveness of such
to disk during the first iterations. After some itera- optimization.
tion the dataset may become small enough (thanks to
anti-monotone frequency pruning) to be stored in the References
main memory in a vertical format, and after that the
algorithm goes on performing tid-lists intersections to [1] C.Targa, A.Prado, and A.Plastino. Improving direct
retrieve itemsets supports, and searches among can- counting for frequent set mining. Technical report, In-
didates are not needed anymore. Usually the dataset stituto de Computação, UFF RT-02/09 2003.
happens to be small enough at most at the fourth iter- [2] Claudio Lucchese, Salvatore Orlando, Paolo Palmerini,
ation. Raffaele Perego, and Fabrizio Silvestri. kdci: a multi-
strategy algorithm for mining frequent sets. In Proceed-
ings of the IEEE ICDM Workshop on Frequent Itemset
2. Experiments and Conclusion Mining Implementations, November 2003.
[3] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
The experiments show that the improvement given Adaptive and resource-aware mining of frequent sets. In
by this optimization is sensible in some cases. The Proc. The 2002 IEEE International Conference on Data
time needed for the third iteration is halved. Note that Mining (ICDM02), page 338345, 2002.