=Paper= {{Paper |id=Vol-126/paper-9 |storemode=property |title=AIM2: Improved implementation of AIM |pdfUrl=https://ceur-ws.org/Vol-126/shporer.pdf |volume=Vol-126 |dblpUrl=https://dblp.org/rec/conf/fimi/Shporer04 }} ==AIM2: Improved implementation of AIM== https://ceur-ws.org/Vol-126/shporer.pdf
                         AIM2: Improved implementation of AIM

                                                   Sagi Shporer
                                            School of Computer Science
                                                Tel-Aviv University
                                                  Tel Aviv, Israel
                                                 shporer@tau.ac.il


                       Abstract                                sets [9], Dynamic Reordering [5], Parent Equivalence
                                                               Pruning [3, 8] and Bit-vector projection [3].
    We present AIM2-F, an improved implementation                 High level pseudo code for the AIM-F algorithm ap-
of AIM-F [4] algorithm for mining frequent itemsets.           pears in Figure 1.
Past studies have proposed various algorithms and tech-
niques for improving the efficiency of the mining task.         AIM-F(n : node, minsupport : integer)
We have presented AIM-F at FIMI’03, a combina-                 (1) t = n.tail
tion of some techniques into an algorithm which uti-           (2) for each α in t                      S
lize those techniques dynamically according to the in-         (3)     Compute sα = support(n.head α)
put dataset. The algorithm main features include depth         (4)     if (sα = support(n.head))
first search with vertical compressed database, diffset,       (5)          add α to the list of items removed by PEP
parent equivalence pruning, dynamic reordering and             (6)          remove α from t
projection. Experimental testing suggests that AIM2-           (7)     else if (sα < minsupport)
F outperforms existing algorithm implementations on            (8)          remove α from t
various datasets.                                              (9) Sort items in t by sα in ascending order.
                                                               (10) While t 6= ∅
                                                               (11)    Let α be the first item in t
                                                               (12)    remove α from t S
1. Introduction                                                (13)    n0 .head = n.head α
                                                               (14)    n0 .tail = t    S
    Finding association rules is one of the driving ap-        (15)    Report n0 .head {All subsets of items
plications in data mining, and much research has been                      removed by PEP} as frequent itemsets
done in this field [7, 3, 5]. Using the support-confidence     (16)    AIM-F(n0 )
framework, proposed in the seminal paper of [1], the
problem is split into two parts — (a) finding frequent
itemsets, and (b) generating association rules.
    Let I be a set of items. A subset X ⊆ I is called                            Figure 1. AIM-F
an itemset. Let D be a transactional database, where
each transaction T ∈ D is a subset of I : T ⊆ I. For an
itemset X, support(X) is defined to be the number of           2. Implementation Improvements
transactions T for which X ⊆ T . For a given parameter
minsupport, an itemset X is call a frequent itemset              We now describe the difference between AIM-F and
if support(X) ≥ minsupport. The set of all frequent            AIM2-F implementations:
itemsets is denoted by F.
                                                                 • Integer to String conversions - Experiments run
    We have presented AIM-F [4] for mining frequent
                                                                   time analysis have shown that the conversion of
itemsets. The AIM-F algorithm build upon several
                                                                   integers to strings is a major CPU consumer. To
ideas appearing in previous work, a partial list of which
                                                                   reduce conversion time two steps are taken:
is the following: Apriori [2], Lexicographic Trees and
Depth First Search Traversal [6], Dynamic Reordering                 – Item name conversion - When printing an
[5], Vertical Bit Vectors [7, 3], Projection [3], Difference           itemset all the item names in the itemset are
Figure 2. Connect dataset: Testing AIM2-F                     Figure 3. Chess dataset: Testing AIM2-F with
with and without the string conversion im-                    and without the string conversion improve-
provements                                                    ments


       printed. In this mining task the items are               minSupport. This enables the construction of the
       numbers, and need to be converted to strings.            F 2 for larger datasets.
       Instead of creating the string every time be-          • Input buffer reuse - In AIM-F the dataset load
       fore printing, the conversion is done once for           method allocated an input buffer for every trans-
       every item, when the item is loaded during               action read. Switching to a single input buffer
       the dataset reading process.                             that is re-used for all the transactions reduced the
    – Support conversion - To print the support it              loading time in AIM2-F by nearly 50%. However
      must be converted to a string. To enable fast             the loading time is usually insignificant comparing
      conversion of the support value to string, a              to the overall runtime (unless the support is very
      static lookup table from integer to string was            high).
      added. The lookup table contains the 64K
      integer values above the minSupport. Every            References
      entry in the lookup table has the string repre-
      sentation of the entry attached. Every time a         [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining as-
      support value needs to be converted to string,            sociation rules between sets of items in large databases.
      it is first checked if the value appears in the           In SIGMOD, pages 207–216, 1993.
      lookup table, if so, the string is taken from         [2] R. Agrawal and R. Srikant. Fast algorithms for mining
      the table, with a very low cost.                          association rules. In VLDB, pages 487–499, 1994.
                                                            [3] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: a
                                                                maximal frequent itemset algorithm for transactional
  In figures 2 and 3 we compare the AIM2-F al-
                                                                databases. In ICDE, 2001.
  gorithm runtime with and without the string con-          [4] A. Fiat and S. Shporer. Aim: Another itemset miner.
  version improvement. It is clear that this improve-           In FIMI, 2003.
  ment alone contribute up to an order of magnitude         [5] R. J. B. Jr. Efficiently mining long patterns from
  improvement. As the size of the input increases               databases. In SIGMOD, pages 85–93, 1998.
  (lower support) the contribution of the string con-       [6] R. Rymon. Search through systematic set enumeration.
  version improvements increases.                               In KR-92, pages 539–550, 1992.
                                                            [7] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia,
• Late F 2 matrix construction - The size of the                M. Bawa, and D. Shah. Turbo-charging vertical mining
  F 2 matrix is I 2 where I is the number of items.             of large databases. In SIGMOD, 2000.
                                                            [8] M. J. Zaki. Scalable algorithms for association mining.
  In datasets where the number of items is very
                                                                Knowledge and Data Engineering, 12(2):372–390, 2000.
  large the F 2 matrix can not be constructed. The          [9] M. J. Zaki and K. Gouda. Fast vertical mining using
  improvement in AIM2-F is that the F 2 matrix                  diffsets. In KDD, pages 326–335, 2003.
  is built only for items for which support(i) ≥

                                                        2