-

AIM2: Improved implementation of AIM

0 Sagi Shporer School of Computer Science Tel-Aviv University Tel Aviv , Israel

We present AIM2-F , an improved implementation of AIM-F [4] algorithm for mining frequent itemsets. Past studies have proposed various algorithms and techniques for improving the e±ciency of the mining task. We have presented AIM-F at FIMI'03, a combination of some techniques into an algorithm which utilize those techniques dynamically according to the input dataset. The algorithm main features include depth ¯rst search with vertical compressed database, di®set, parent equivalence pruning, dynamic reordering and projection. Experimental testing suggests that AIM2F outperforms existing algorithm implementations on various datasets.

1. Introduction

Finding association rules is one of the driving applications in data mining, and much research has been done in this ¯eld [ 7, 3, 5 ]. Using the support-con¯dence framework, proposed in the seminal paper of [ 1 ], the problem is split into two parts | (a) ¯nding frequent itemsets, and (b) generating association rules.

Let I be a set of items. A subset X µ I is called an itemset. Let D be a transactional database, where each transaction T 2 D is a subset of I : T µ I. For an itemset X, support(X) is de¯ned to be the number of transactions T for which X µ T . For a given parameter minsupport, an itemset X is call a frequent itemset if support(X) ¸ minsupport. The set of all frequent itemsets is denoted by F .

We have presented AIM-F [ 4 ] for mining frequent itemsets. The AIM-F algorithm build upon several ideas appearing in previous work, a partial list of which is the following: Apriori [ 2 ], Lexicographic Trees and Depth First Search Traversal [ 6 ], Dynamic Reordering [ 5 ], Vertical Bit Vectors [ 7, 3 ], Projection [ 3 ], Di®erence sets [ 9 ], Dynamic Reordering [ 5 ], Parent Equivalence Pruning [ 3, 8 ] and Bit-vector projection [ 3 ].

High level pseudo code for the AIM-F algorithm appears in Figure 1.

AIM-F (n : node, minsupport : integer) (1) t = n:tail (2) for each ® in t (3) Compute s® = support(n:head S ®) (4) if (s® = support(n:head)) (5) add ® to the list of items removed by PEP (6) remove ® from t (7) else if (s® < minsupport) (8) remove ® from t (9) Sort items in t by s® in ascending order. (10) While t 6= ; (11) Let ® be the ¯rst item in t (12) remove ® from t (13) n0:head = n:head S ® (14) n0:tail = t (15) Report n0:head SfAll subsets of items removed by PEPg as frequent itemsets (16) AIM-F (n0) 2. Implementation Improvements

We now describe the di®erence between AIM-F and AIM2-F implementations: ² Integer to String conversions - Experiments run time analysis have shown that the conversion of integers to strings is a major CPU consumer. To reduce conversion time two steps are taken: { Item name conversion - When printing an itemset all the item names in the itemset are printed. In this mining task the items are numbers, and need to be converted to strings. Instead of creating the string every time before printing, the conversion is done once for every item, when the item is loaded during the dataset reading process. { Support conversion - To print the support it must be converted to a string. To enable fast conversion of the support value to string, a static lookup table from integer to string was added. The lookup table contains the 64K integer values above the minSupport. Every entry in the lookup table has the string representation of the entry attached. Every time a support value needs to be converted to string, it is ¯rst checked if the value appears in the lookup table, if so, the string is taken from the table, with a very low cost.

In ¯gures 2 and 3 we compare the AIM2-F algorithm runtime with and without the string conversion improvement. It is clear that this improvement alone contribute up to an order of magnitude improvement. As the size of the input increases (lower support) the contribution of the string conversion improvements increases. ² Late F 2 matrix construction - The size of the F 2 matrix is I2 where I is the number of items. In datasets where the number of items is very large the F 2 matrix can not be constructed. The improvement in AIM2-F is that the F 2 matrix is built only for items for which support(i) ¸ minSupport. This enables the construction of the F 2 for larger datasets. ² Input bu®er reuse - In AIM-F the dataset load method allocated an input bu®er for every transaction read. Switching to a single input bu®er that is re-used for all the transactions reduced the loading time in AIM2-F by nearly 50%. However the loading time is usually insigni¯cant comparing to the overall runtime (unless the support is very high).

[1]

Agrawal ,

Imielinski , and

A. N.

Swami . Mining association rules between sets of items in large databases . In SIGMOD , pages 207 { 216 , 1993 .

[2]

Agrawal and

Srikant . Fast algorithms for mining association rules . In VLDB , pages 487 { 499 , 1994 .

[3]

Burdick ,

Calimlim , and

Gehrke . Ma¯ a: a maximal frequent itemset algorithm for transactional databases . In ICDE , 2001 .

[4]

Fiat and

Shporer . Aim: Another itemset miner . In FIMI , 2003 .

[5]

R. J. B.

Jr . E± ciently mining long patterns from databases . In SIGMOD , pages 85 { 93 , 1998 .

[6]

Rymon . Search through systematic set enumeration . In KR-92 , pages 539 { 550 , 1992 .

[7]

Shenoy ,

J. R.

Haritsa ,

Sundarshan , G. Bhalotia,

Bawa , and

Shah . Turbo-charging vertical mining of large databases . In SIGMOD , 2000 .

[8]

M. J.

Zaki . Scalable algorithms for association mining . Knowledge and Data Engineering , 12 ( 2 ): 372 { 390 , 2000 .

[9]

M. J.

Zaki and

Gouda . Fast vertical mining using di®sets . In KDD , pages 326 { 335 , 2003 .