=Paper= {{Paper |id=Vol-126/paper-9 |storemode=property |title=AIM2: Improved implementation of AIM |pdfUrl=https://ceur-ws.org/Vol-126/shporer.pdf |volume=Vol-126 |dblpUrl=https://dblp.org/rec/conf/fimi/Shporer04 }} ==AIM2: Improved implementation of AIM== https://ceur-ws.org/Vol-126/shporer.pdf

AIM2: Improved implementation of AIM

Sagi Shporer
School of Computer Science
Tel-Aviv University
Tel Aviv, Israel
shporer@tau.ac.il

Abstract sets [9], Dynamic Reordering [5], Parent Equivalence
Pruning [3, 8] and Bit-vector projection [3].
We present AIM2-F, an improved implementation High level pseudo code for the AIM-F algorithm ap-
of AIM-F [4] algorithm for mining frequent itemsets. pears in Figure 1.
Past studies have proposed various algorithms and tech-
niques for improving the efficiency of the mining task. AIM-F(n : node, minsupport : integer)
We have presented AIM-F at FIMI’03, a combina- (1) t = n.tail
tion of some techniques into an algorithm which uti- (2) for each α in t S
lize those techniques dynamically according to the in- (3) Compute sα = support(n.head α)
put dataset. The algorithm main features include depth (4) if (sα = support(n.head))
first search with vertical compressed database, diffset, (5) add α to the list of items removed by PEP
parent equivalence pruning, dynamic reordering and (6) remove α from t
projection. Experimental testing suggests that AIM2- (7) else if (sα < minsupport)
F outperforms existing algorithm implementations on (8) remove α from t
various datasets. (9) Sort items in t by sα in ascending order.
(10) While t 6= ∅
(11) Let α be the first item in t
(12) remove α from t S
1. Introduction (13) n0 .head = n.head α
(14) n0 .tail = t S
Finding association rules is one of the driving ap- (15) Report n0 .head {All subsets of items
plications in data mining, and much research has been removed by PEP} as frequent itemsets
done in this field [7, 3, 5]. Using the support-confidence (16) AIM-F(n0 )
framework, proposed in the seminal paper of [1], the
problem is split into two parts — (a) finding frequent
itemsets, and (b) generating association rules.
Let I be a set of items. A subset X ⊆ I is called Figure 1. AIM-F
an itemset. Let D be a transactional database, where
each transaction T ∈ D is a subset of I : T ⊆ I. For an
itemset X, support(X) is defined to be the number of 2. Implementation Improvements
transactions T for which X ⊆ T . For a given parameter
minsupport, an itemset X is call a frequent itemset We now describe the difference between AIM-F and
if support(X) ≥ minsupport. The set of all frequent AIM2-F implementations:
itemsets is denoted by F.
• Integer to String conversions - Experiments run
We have presented AIM-F [4] for mining frequent
time analysis have shown that the conversion of
itemsets. The AIM-F algorithm build upon several
integers to strings is a major CPU consumer. To
ideas appearing in previous work, a partial list of which
reduce conversion time two steps are taken:
is the following: Apriori [2], Lexicographic Trees and
Depth First Search Traversal [6], Dynamic Reordering – Item name conversion - When printing an
[5], Vertical Bit Vectors [7, 3], Projection [3], Difference itemset all the item names in the itemset are
Figure 2. Connect dataset: Testing AIM2-F Figure 3. Chess dataset: Testing AIM2-F with
with and without the string conversion im- and without the string conversion improve-
provements ments

printed. In this mining task the items are minSupport. This enables the construction of the
numbers, and need to be converted to strings. F 2 for larger datasets.
Instead of creating the string every time be- • Input buffer reuse - In AIM-F the dataset load
fore printing, the conversion is done once for method allocated an input buffer for every trans-
every item, when the item is loaded during action read. Switching to a single input buffer
the dataset reading process. that is re-used for all the transactions reduced the
– Support conversion - To print the support it loading time in AIM2-F by nearly 50%. However
must be converted to a string. To enable fast the loading time is usually insignificant comparing
conversion of the support value to string, a to the overall runtime (unless the support is very
static lookup table from integer to string was high).
added. The lookup table contains the 64K
integer values above the minSupport. Every References
entry in the lookup table has the string repre-
sentation of the entry attached. Every time a [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining as-
support value needs to be converted to string, sociation rules between sets of items in large databases.
it is first checked if the value appears in the In SIGMOD, pages 207–216, 1993.
lookup table, if so, the string is taken from [2] R. Agrawal and R. Srikant. Fast algorithms for mining
the table, with a very low cost. association rules. In VLDB, pages 487–499, 1994.
[3] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: a
maximal frequent itemset algorithm for transactional
In figures 2 and 3 we compare the AIM2-F al-
databases. In ICDE, 2001.
gorithm runtime with and without the string con- [4] A. Fiat and S. Shporer. Aim: Another itemset miner.
version improvement. It is clear that this improve- In FIMI, 2003.
ment alone contribute up to an order of magnitude [5] R. J. B. Jr. Efficiently mining long patterns from
improvement. As the size of the input increases databases. In SIGMOD, pages 85–93, 1998.
(lower support) the contribution of the string con- [6] R. Rymon. Search through systematic set enumeration.
version improvements increases. In KR-92, pages 539–550, 1992.
[7] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia,
• Late F 2 matrix construction - The size of the M. Bawa, and D. Shah. Turbo-charging vertical mining
F 2 matrix is I 2 where I is the number of items. of large databases. In SIGMOD, 2000.
[8] M. J. Zaki. Scalable algorithms for association mining.
In datasets where the number of items is very
Knowledge and Data Engineering, 12(2):372–390, 2000.
large the F 2 matrix can not be constructed. The [9] M. J. Zaki and K. Gouda. Fast vertical mining using
improvement in AIM2-F is that the F 2 matrix diffsets. In KDD, pages 326–335, 2003.
is built only for items for which support(i) ≥