ARMOR: Association Rule Mining based on ORacle

                                Vikram Pudi                                 Jayant R. Haritsa
                  Intl. Inst. of Information Technology               Database Systems Lab, SERC
                              Hyderabad, India                         Indian Institute of Science
                              vikram@iiit.net                               Bangalore, India
                                                                      haritsa@dsl.serc.iisc.ernet.in


                         Abstract                                ward. In particular, it is critically dependent on the choice
                                                                 of data structures used during the counting process. We
   In this paper, we first focus our attention on the question   present a carefully engineered implementation of Oracle
of how much space remains for performance improvement            that makes the best choices for these design parameters at
over current association rule mining algorithms. Our strat-      each stage of the counting process. Our experimental results
egy is to compare their performance against an “Oracle al-       show that there is a considerable gap in the performance be-
gorithm” that knows in advance the identities of all frequent    tween the Oracle and existing mining algorithms.
itemsets in the database and only needs to gather their ac-         Second, we present a new mining algorithm, called AR-
tual supports to complete the mining process. Our experi-        MOR (Association Rule Mining based on ORacle), whose
mental results show that current mining algorithms do not        structure is derived by making minimal changes to the Or-
perform uniformly well with respect to the Oracle for all        acle, and is guaranteed to complete in two passes over the
database characteristics and support thresholds. In many         database. Although ARMOR is derived from the Oracle, it
cases there is a substantial gap between the Oracle’s perfor-    may be seen to share the positive features of a variety of
mance and that of the current mining algorithms. Second,         previous algorithms such as PARTITION [9], CARMA [5],
we present a new mining algorithm, called ARMOR, that is         AS-CPA [6], VIPER [10] and DELTA [7]. Our empirical
constructed by making minimal changes to the Oracle al-          study shows that ARMOR consistently performs within a
gorithm. ARMOR consistently performs within a factor of          factor of two of the Oracle, over both real (BMS-WebView-
two of the Oracle on both real and synthetic datasets over       1 [14] from Blue Martini Software) and synthetic databases
practical ranges of support specifications.                      (from the IBM Almaden generator [2]) over practical ranges
                                                                 of support specifications.
                                                                    The environment we consider is one where the pattern
1 Introduction                                                   lengths in the database are small enough that the size of
                                                                 mining results is comparable to the available main memory.
   We focus our attention on the question of how much            This holds when the mined data conforms to the sparse na-
space remains for performance improvement over current           ture of market basket data for which association rule mining
association rule mining algorithms. Our approach is to com-      was originally intended. It is perhaps inappropriate to ap-
pare their performance against an “Oracle algorithm” that        ply the problem of mining all frequent itemsets on dense
knows in advance the identities of all frequent itemsets in      datasets.
the database and only needs to gather the actual supports           For ease of exposition, we will use the notation shown in
of these itemsets to complete the mining process. Clearly,       Table 1 in the remainder of this paper.
any practical algorithm will have to do at least this much
work in order to generate mining rules. This “Oracle ap-         Organization The remainder of this paper is organized as
proach” permits us to clearly demarcate the maximal space        follows: The design of the Oracle algorithm is described in
available for performance improvement over the currently         Section 2 and is used to evaluate the performance of current
available algorithms. Further, it enables us to construct new    algorithms in Section 3. Our new ARMOR algorithm is
mining algorithms from a completely different perspective,       presented in Section 4 and its main memory requirements
namely, as minimally-altered derivatives of the Oracle.          are discussed in Section 5. The performance of ARMOR is
   First, we show that while the notion of the Oracle is con-    evaluated in Section 6. Finally, in Section 7, we summarize
ceptually simple, its construction is not equally straightfor-   the conclusions of our study.
               Database of customer purchase transactions            ArrayCount (  ½ ¾ )
          User-specified minimum rule support                   Input: Transaction , 1-itemsets Array ½ , 2-itemsets Array ¾
              Set of frequent itemsets in                           Output: Arrays ½ and ¾ with their counts updated over
                                                      
               Set of itemsets in the negative border of                   Itemset  = null;
 ½                 
                                                                     1.
               Set of disjoint partitions of                                  // to store frequent items from in Item-List format
               Transactions in partitions scanned so far during      2.                 
                                                                           for each item in transaction
               algorithm execution excluding the current partition   3.            
                                                                               ½                 ;
  ·
               Transactions in partitions scanned so far during      4.        if ½   
                                                                                                    null
               algorithm execution including the current partition   5.                   
                                                                                    append to 
              DAG structure to store candidates                     6.        
                                                                           for = 1 to    // enumerate 2-itemsets
                                                                     7.           
                                                                               for =   to   
                      Table 1. Notation                              8.                        
                                                                                            ½ = ½
                                                                                                               // row index
                                                                     9.                        
                                                                                            ¾ = ½
                                                                                                               // column index
                                                                     10.            ¾   
                                                                                                ½       ¾   ;

2 The Oracle Algorithm                                                   Figure 1. Counting Singletons and Pairs in Oracle

   In this section we present the Oracle algorithm which,
as mentioned in the Introduction, “magically” knows in ad-
vance the identities of all frequent itemsets in the database        Algorithm Description The ArrayCount function shown
and only needs to gather the actual supports of these item-          in Figure 1 takes as inputs, a transaction along with  ½
sets. Clearly, any practical algorithm will have to do at least      and ¾ , and updates the counters of these arrays over . In
this much work in order to generate mining rules. Oracle             the ArrayCount function, the individual items in the trans-
takes as input the database,       in item-list format (which        action are enumerated (lines 2–5) and for each item, its
is organized as a set of rows with each row storing an or-           corresponding count in  ½ is incremented (line 3). During
dered list of item-identifiers (IID), representing the items         this process, the frequent items in are stored in a separate
purchased in the transaction), the set of frequent itemsets,         itemset     (line 5). We then enumerate all pairs of items
  , and its corresponding negative border,  , and outputs           contained in      (lines 6–10) and increment the counters of
the supports of these itemsets by making one scan over the           the corresponding 2-itemsets in  ¾ (lines 8–10).
database. We first describe the mechanics of the Oracle al-
gorithm below and then move on to discuss the rationale
behind its design choices in Section 2.2.                            2.1.2 Counting  -itemsets,  

                                                                     Data-Structure Description Itemsets in   of length
2.1 The Mechanics of Oracle                                          greater than 2 and their related information (counters, etc.)
                                                                     are stored in a DAG structure  , which is pictorially shown
    For ease of exposition, we first present the manner in           in Figure 2 for a database with items A, B, C, D. Al-
which Oracle computes the supports of 1-itemsets and 2-              though singletons and pairs are stored in lookup arrays, as
itemsets and then move on to longer itemsets. Note, how-             mentioned before, for expository ease, we assume that they
ever, that the algorithm actually performs all these compu-          too are stored in  in the remainder of this discussion.
tations concurrently in one scan over the database.                      Each itemset is stored in a separate node of  and is
                                                                     linked to the first two (in a lexicographic ordering) of its
                                                                     subsets. We use the terms “mother” and “father” of an item-
2.1.1 Counting Singletons and Pairs                                  set to refer to the (lexicographically) smaller subset and the
                                                                     (lexicographically) larger subset, respectively. E.g., A, B
Data-Structure Description The counters of singletons                and A, C are the mother and father respectively of A, B,
(1-itemsets) are maintained in a 1-dimensional lookup ar-            C. For each itemset  in  , we also store with it links to
ray, ½ , and that of pairs (2-itemsets), in a lower triangu-        those supersets of  for which  is a mother. We call this
lar 2-dimensional lookup array,  ¾ (Similar arrays are also         list of links as childset.
used in Apriori [2, 11] for its first two passes.) The  th entry        Since each itemset is stored in a separate node in the
in the array ½ contains two fields: (1) , the counter          DAG, we use the terms “itemset” and “node” interchange-
for the itemset  corresponding to the  th item, and (2)            ably in the remainder of this discussion. Also, we use  to
  , the number of frequent itemsets prior to  in  ½ , if          denote the set of itemsets that are stored in the DAG struc-
 is frequent; null, otherwise.                                      ture  .
                   A    B       C     D
                                                                                   Oracle ( ,  )
                                                                                   Input: Database , Itemsets to be Counted     
                                                                                   Output: Itemsets in  with Supports
       AB          AC   AD      BC     BD       CD
                                                                                   1.     = Number of Partitions
                                                                                   2.         
                                                                                         for = 1 to    
                                                                                   3.                               
                                                                                             ReadNextPartition( ,  );
                  ABC   ABD     ACD       BCD                                      4.                              
                                                                                             for each singleton in 
                                                                                                            
                                                                mother
                                                                                   5.            Update( );
                                                                father

                            ABCD                                                                  Figure 3. The Oracle Algorithm

   Figure 2. DAG Structure Containing Power Set of                                          
                                                                                   Update ( )
   A,B,C,D                                                                       Input: DAG Node   
                                                                                   Output:    and its Descendents with Counts Updated
                                                                                   1.      = convert     to Tid-vector format
                                                                                             
                                                                                           // is statically allocated
Algorithm Description We use a partitioning [9] scheme                             2.                     
                                                                                         for each node in
wherein the database is logically divided into  disjoint hor-                     3.                       
                                                                                                        = Intersect( ,                 );
izontal partitions ½  ¾    . In this scheme, itemsets                    4.            
                                                                                                       +=            
                                                                                   5.                     
                                                                                         for each node in
being counted are enumerated only at the end of each par-
tition and not after every tuple. Each partition is as large as                    6.                
                                                                                             Update( );
can fit in available main memory. For ease of exposition, we
                                                                                                    Figure 4. Updating Counts
assume that the partitions are equi-sized. However, we has-
ten to add that the technique is easily extendible to arbitrary
                                                                                               
                                                                                   Intersect ( , )
                                                                                                       
partition sizes.
                                                                                   Input: Tid-vector , Tid-list
    The pseudo-code of Oracle is shown in Figure 3 and op-
                                                                                   Output:   
erates as follows: The ReadNextPartition function (line 3)
                                                                                   1.     Tid-list  =
reads tuples from the next partition and simultaneously cre-                       2.     for each    in
ates tid-lists½ (within that partition) of singleton itemsets in
 . Note that this conversion of the database to the tid-list
                                                                                   3.             =        (tid of first transaction in current
                                                                                   partition)
(TL) format is an on-the-fly operation and does not change                         4.         if      = 1 then
the complexity of Oracle by more than a (small) constant                           5.                
                                                                                                           =        
factor. Next, the Update function (line 5) is applied on                           6.     return   
each singleton in  . This function takes a node  in  as
input and updates the counts of all descendants of  to re-                                            Figure 5. Intersection
flect their counts over the current partition. The count of
any itemset within a partition is equal to the length of its
corresponding tidlist (within that partition). The tidlist of
an itemset can be obtained as the intersection of the tidlists                        The above processing is captured in the function Update
of its mother and father and this process is started off using                    whose pseudo-code is shown in Figure 4. Here, the tidlist
the tidlists of frequent 1-itemsets. The exact details of tidlist                 of a given node  is first converted to the tid-vector (TV)
computation are discussed later.                                                  format¾ (line 1). Then, tidlists of all children of  are com-
    We now describe the manner in which the itemsets in                          puted (lines 2–4) after which the same children are visited
                                                                                  in a depth first search (lines 5–6).
          Ë
are enumerated after reading in a new partition. The set of
links,      , induce a spanning tree of  (e.g.                              The mechanics of tidlist computation, as promised ear-
consider only the solid edges in Figure 2). We perform a                          lier, are given in Figure 5. The Intersect function shown
depth first search on this spanning tree to enumerate all its                     here takes as input a tid-vector  and a tid-list . Each 
itemsets. When a node in the tree is visited, we compute the                      in is added to the result if    is 1 (lines 2–5) where
tidlists of all its children. This ensures that when an itemset                      is defined in line 3 and represents the position of the
is visited, the tidlists of its mother and father have already                    transaction relative to the current partition.
been computed.
                                                                                     ¾ A tid-vector of an itemset    is a bit-vector of 1’s and 0’s to rep-
   ½ A tid-list of an itemset   is an ordered list of TIDs of transactions that   resent the presence or absence respectively, of    in the set of customer
contain       .                                                                   transactions.
2.2 Rationale for the Oracle Design                              3 Performance Study

    We show that Oracle is optimal in two respects: (1) It           In the previous section, we have described the Oracle al-
enumerates only those itemsets in  that need to be enu-         gorithm. In order to assess the performance of current min-
merated, and (2) The enumeration is performed in the most        ing algorithms with respect to the Oracle algorithm, we have
efficient way possible. These results are based on the fol-      chosen VIPER [10] and FP-growth [4], among the latest in
lowing two theorems. Due to lack of space we have deferred       the suite of online mining algorithms. For completeness
the proofs of theorems to [8].                                   and as a reference point, we have also included the classical
                                                                 Apriori in our evaluation suite.
                                                                     Our experiments cover a range of database and min-
Theorem 2.1 If the size of each partition is large enough
that every itemset in        of length greater than 2 is       ing workloads, and include the typical and extreme cases
                                                                 considered in previous studies – the only difference is that
present at least once in it, then the only itemsets being enu-
                                                                 we also consider database sizes that are significantly larger
merated in the Oracle algorithm are those whose counts
                                                                 than the available main memory. The performance metric
need to be incremented in that partition.
                                                                 in all the experiments is the total execution time taken by
                                                                 the mining operation.
Theorem 2.2 The cost of enumerating each itemset in Or-              The databases used in our experiments were syntheti-
acle is  with a tight constant factor.                       cally generated using the technique described in [2] and at-
                                                                 tempt to mimic the customer purchase behavior seen in re-
                                                                 tailing environments. The parameters used in the synthetic
    While Oracle is optimal in these respects, we note that      generator and their default values are described in Table 2.
there may remain some scope for improvement in the details       In particular, we consider databases with parameters T10.I4,
of tidlist computation. That is, the Intersect function (Fig-    T20.I12 and T40.I8 with 10 million tuples in each of them.
ure 5) which computes the intersection of a tid-vector  and
a tid-list requires   operations.  itself was origi-
nally constructed from a tid-list, although this cost is amor-      Parameter    Meaning                      Default Values
tized over many calls to the Intersect function. We plan                        Number of items              1000
                                                                                 Mean transaction length      10, 20, 40
to investigate in our future work whether the intersection of
two sets can, in general, be computed more efficiently – for                    Number of potentially        2000
                                                                                 frequent itemsets
example, using diffsets, a novel and interesting approach
suggested in [13]. The diffset of an itemset  is the set-
                                                                                Mean length of potentially   4, 8, 12
                                                                                 frequent itemsets
difference of the tid-list of  from that of its mother. Diff-                  Number of transactions       10M
sets can be easily incorporated in Oracle – only the Update                      in the database
function in Figure 4 of Section 2 is to be changed to com-
pute diffsets instead of tidlists by following the techniques                    Table 2. Parameter Table
suggested in [13].

                                                                     We set the rule support threshold values to as low as
Advantages of Partitioning Schemes Oracle, as dis-               was feasible with the available main memory. At these low
cussed above, uses a partitioning scheme. An alternative         support values the number of frequent itemsets exceeded
commonly used in current association rule mining algo-           twenty five thousand! Beyond this, we felt that the number
rithms, especially in hashtree [2] based schemes, is to use a    of rules generated would be enormous and the purpose of
tuple-by-tuple approach. A problem with the tuple-by-tuple       mining – to find interesting patterns – would not be served.
approach, however, is that there is considerable wasted enu-     In particular, we set the rule support threshold values for the
meration of itemsets. The core operation in these algorithms     T10.I4, T20.I12 and T40.I8 databases to the ranges (0.1%–
is to determine all candidates that are subsets of the current   2%), (0.4%–2%) and (1.15%–5%), respectively.
transaction. Given that a frequent itemset  is present in           Our experiments were conducted on a 700-MHz Pentium
the current transaction, we need to determine all candidates     III workstation running Red Hat Linux 6.2, configured with
that are immediate supersets of  and are also present in        a 512 MB main memory and a local 18 GB SCSI 10000
the current transaction. In order to achieve this, it is often   rpm disk. For the T10.I4, T20.I12 and T40.I8 databases,
necessary to enumerate and check for the presence of many        the associated database sizes were approximately 500MB,
more candidates than those that are actually present in the      900MB and 1.7 GB, respectively. All the algorithms in our
current transaction.                                             evaluation suite are written in C++. We implemented a ba-
sic version of the FP-growth algorithm ¿ wherein we assume         small databases consisting of 100K transactions. The re-
that the entire FP-tree data structure fits in main memory.        sults of this experiment are shown in Figures 7a–c, which
Finally, the partition size in Oracle was fixed to be 20K tu-      correspond to the T10.I4, T20.I12 and T40.I8 databases, re-
ples.                                                              spectively.
                                                                       In these graphs, we see there continues to be a con-
3.1 Experimental Results for Current Mining Al-                    siderable gap in the performance of current mining algo-
    gorithms                                                       rithms with respect to Oracle. For example, for the T40.I8
                                                                   database, the response time of FP-growth is more than 8
    We now report on our experimental results. We con-             times that of Oracle for the entire support threshold range.
ducted two experiments to evaluate the performance of cur-
rent mining algorithms with respect to the Oracle. Our             4 The ARMOR Algorithm
first experiment was run on large (10M tuples) databases,
while our second experiment was run on small (100K tu-
ples) databases.
                                                                   ARMOR ( ,        )
                                                                                                     
                                                                   Input: Database , Set of Items , Minimum Support       
3.1.1 Experiment 1: Performance of Current Algo-
                                                                               
                                                                   Output:  with Supports
      rithms
                                                                   1.     = Number of Partitions
In our first experiment, we evaluated the performance of
                                                                         //—– First Pass —–
Apriori, VIPER and Oracle algorithms for the T10.I4,               2.      =       // candidate set (in a DAG)
T20.I12 and T40.I8 databases each containing 10M trans-            3.         
                                                                           for = 1 to 
actions and these results are shown in Figures 6a–c. The           4.                       
                                                                              ReadNextPartition( ,  );
x-axis in these graphs represent the support threshold val-        5.                      
                                                                              for each singleton in 
ues while the y-axis represents the response times of the          6.                  
                                                                                            +=             
algorithms being evaluated.                                        7.                     
                                                                                  Update1( ,              );
   In these graphs, we see that the response times of all al-
                                                                         //—– Second Pass —–
gorithms increase exponentially as the support threshold is        8.      RemoveSmall( ,        );
reduced. This is only to be expected since the number of           9.      OutputFinished( ,       );
itemsets in the output,   , increases exponentially with         10.         
                                                                           for = 1 to  
decrease in the support threshold.                                 11.        if (all candidates in  have been output)
   We also see that there is a considerable gap in the perfor-     12.             exit
mance of both Apriori and VIPER with respect to Oracle.            13.                         
                                                                              ReadNextPartition( ,  );
For example, in Figure 6a, at a support threshold of 0.1%,         14.                        
                                                                              for each singleton in 
the response time of VIPER is more than 6 times that of Or-        15.                       
                                                                                   Update2( ,              );
acle whereas the response time of Apriori is more than 26
times!                                                                         Figure 8. The ARMOR Algorithm
   In this experiment, we could not evaluate the perfor-
mance of FP-growth because it did not complete in any of
                                                                       In the previous section, our experimental results have
our runs on large databases due to its heavy and database
                                                                   shown that there is a considerable gap in the performance
size dependent utilization of main memory. The reason for
                                                                   between the Oracle and existing mining algorithms. We
this is that FP-growth stores the database itself in a con-
                                                                   now move on to describe our new mining algorithm, AR-
densed representation in a data structure called FP-tree. In
                                                                   MOR (Association Rule Mining based on ORacle). In this
[4], the authors briefly discuss the issue of constructing
                                                                   section, we overview the main features and the flow of exe-
disk-resident FP-trees. We however, did not take this into
                                                                   cution of ARMOR – the details of candidate generation are
account in our implementation.
                                                                   deferred to [8] due to lack of space.
                                                                       The guiding principle in our design of the ARMOR algo-
3.1.2 Experiment 2: Small Databases                                rithm is that we consciously make an attempt to determine
                                                                   the minimal amount of change to Oracle required to result
Since, as mentioned above, it was not possible for us to eval-
                                                                   in an online algorithm. This is in marked contrast to the ear-
uate the performance of FP-growth on large databases due
                                                                   lier approaches which designed new algorithms by trying to
to its heavy utilization of main memory, we evaluated the
                                                                   address the limitations of previous online algorithms. That
performance of FP-growth and other current algorithms on
                                                                   is, we approach the association rule mining problem from a
  ¿ The original implementation by Han et al. was not available.   completely different perspective.
                                                (a) T10.I4.D10M                                                            (b) T20.I12.D10M                                                            (c) T40.I8.D10M
                           7000                                                                      18000                                                                      25000
                                                                      Oracle                                                                     Oracle                                                                      Oracle
                                                                      Apriori                        16000                                       Apriori                                                                     Apriori
                           6000                                       VIPER                                                                      VIPER                                                                       VIPER
                                                                                                     14000                                                                      20000
                           5000
          Time (seconds)


                                                                                    Time (seconds)


                                                                                                                                                               Time (seconds)
                                                                                                     12000
                                                                                                                                                                                15000
                           4000                                                                      10000

                           3000                                                                       8000
                                                                                                                                                                                10000
                                                                                                      6000
                           2000
                                                                                                      4000                                                                       5000
                           1000
                                                                                                      2000

                                0                                                                          0                                                                          0
                                    0    0.5           1              1.5       2                              0    0.5           1               1.5      2                              0       1     2           3             4    5
                                                Support (as a %)                                                           Support (as a %)                                                            Support (as a %)


                                                     Figure 6. Performance of Current Algorithms (Large Databases)

                                               (a) T10.I4.D100K                                                           (b) T20.I12.D100K                                                           (c) T40.I8.D100K
                           70                                                                        180                                                                        450
                                                                      Oracle                                                                     Oracle                                                                      Oracle
                                                                      Apriori                        160                                         Apriori                        400                                          Apriori
                           60                                         VIPER                                                                      VIPER                                                                       VIPER
                                                                   FP-growth                         140                                      FP-growth                         350                                       FP-growth
                           50
          Time (seconds)


                                                                                    Time (seconds)


                                                                                                                                                               Time (seconds)
                                                                                                     120                                                                        300
                           40                                                                        100                                                                        250

                           30                                                                         80                                                                        200

                                                                                                      60                                                                        150
                           20
                                                                                                      40                                                                        100
                           10
                                                                                                      20                                                                         50

                           0                                                                           0                                                                          0
                                0       0.5           1               1.5       2                          0       0.5           1               1.5       2                          0       1        2            3            4     5
                                               Support (as a %)                                                           Support (as a %)                                                            Support (as a %)


                                                     Figure 7. Performance of Current Algorithms (Small Databases)


    In ARMOR, as in Oracle, the database is conceptually                                                                           stored in the DAG.
partitioned into  disjoint blocks  ½  ¾    . At most                                                                       Along with each candidate  , we also store the follow-
two passes are made over the database. In the first pass we
form a set of candidate itemsets,  , that is guaranteed to
                                                                                                                                   ing three integers as in the CARMA algorithm [5]: (1)
                                                                                                                                    : the number of occurrences of  since  was
be a superset of the set of frequent itemsets. During the                                                                          last inserted in  . (2)      : the index of the
first pass, the counts of candidates in  are determined over                                                                      partition at which  was inserted in  . (3)   
each partition in exactly the same way as in Oracle by main-                                                                       : upper bound on the number of occurrences of  before 
taining the candidates in a DAG structure. The 1-itemsets                                                                          was inserted in  .      and   
and 2-itemsets are stored in lookup arrays as in Oracle. But                                                                       are computed when  is inserted into  in a manner identi-
unlike in Oracle, candidates are inserted and removed from
 at the end of each partition. Generation and removal of                                                                          cal to CARMA.
candidates is done simultaneously while computing counts.                                                                             While the CARMA algorithm works on a tuple-by-tuple
The details of candidate generation and removal during the                                                                         basis, we have adapted the semantics of these fields to suit
first pass are described in [8] due to lack of space. For ease                                                                     the partitioning approach. If the database scanned so far is
of exposition we assume in the remainder of this section                                                                             (refer Table 1), then the support of any candidate  in
that all candidates (including 1-itemsets and 2-itemsets) are                                                                       will lie in the range      
                                                                                                                                     [5]. These bounds are denoted by
minSupport( ) and maxSupport( ), respectively. We                5 Memory Utilization in ARMOR
define an itemset  to be -frequent if minSupport  
 . Unlike in the CARMA algorithm where only -                    In the design and implementation of ARMOR, we have
frequent itemsets are stored at any stage, the DAG structure       opted for speed in most decisions that involve a space-speed
in ARMOR contains other candidates, including the neg-             tradeoff. Therefore, the main memory utilization in AR-
ative border of the -frequent itemsets, to ensure efficient        MOR is certainly more as compared to algorithms such as
candidate generation. The details are given in [8].                Apriori. However, in the following discussion, we show that
    At the end of the first pass, the candidate set  is pruned    the memory usage of ARMOR is well within the reaches of
to include only -frequent itemsets and their negative bor-         current machine configurations. This is also experimentally
der. The counts of itemsets in  over the entire database          confirmed in the next section.
are determined during the second pass. The counting pro-              The main memory consumption of ARMOR comes from
cess is again identical to that of Oracle. No new candidates       the following sources: (1) The 1-d and 2-d arrays for stor-
are generated during the second pass. However, candidates          ing counters of singletons and pairs, respectively; (2) The
may be removed. The details of candidate removal in the            DAG structure for storing counters (and tidlists) of longer
second pass is deferred to [8].                                    itemsets, and (3) The current partition.
    The pseudo-code of ARMOR is shown in Figure 8 and                 The total number of entries in the 1-d and 2-d arrays and
is explained below.                                                in the DAG structure corresponds to the number of candi-
                                                                   dates in ARMOR, which as we have discussed in [8], is
4.1 First Pass                                                     only marginally more than    . For the moment, if
                                                                   we disregard the space occupied by tidlists of itemsets, then
   At the beginning of the first pass, the set of candidate        the amortized amount of space taken by each candidate is a
itemsets  is initialized to the set of singleton itemsets (line   small constant (about 10 integers for the dag and 1 integer
2). The ReadNextPartition function (line 4) reads tuples           for the arrays). E.g., if there are 1 million candidates in the
from the next partition and simultaneously creates tid-lists       dag and 10 million in the array, the space required is about
of singleton itemsets in  .                                       80MB. Since the environment we consider is one where
   After reading in the entire partition, the Update1 func-        the pattern lengths are small, the number of candidates will
tion (details in [8]) is applied on each singleton in  (lines     typically be well within the available main memory. [12]
5–7). It increments the counts of existing candidates by           discusses alternative approaches when this assumption does
their corresponding counts in the current partition. It is also    not hold.
responsible for generation and removal of candidates.                 Regarding the space occupied by tidlists of itemsets, note
   At the end of the first pass,  contains a superset of
                                                                   that ARMOR only needs to store tidlists of -frequent item-
the set of frequent itemsets. For a candidate in  that has        sets. The number of -frequent itemsets is of the same or-
                                                                   der as the number of frequent itemsets,  . The total space
been inserted at partition   , its count over the partitions
    will be available.                                    occupied by tidlists while processing partition   is then
                                                                   bounded by       integers. E.g., if     and
                                                                       , then the space occupied by tidlists is bounded
4.2 Second Pass                                                    by about 400MB. We assume   to be in the range of a
                                                                   few thousands at most because otherwise the total number
   At the beginning of the second pass, candidates in  that       of rules generated would be enormous and the purpose of
are neither -frequent nor part of the current negative bor-        mining would not be served. Note that the above bound
der are removed from  (line 8). For candidates that have          is very pessimistic. Typically, the lengths of tidlists are
been inserted in  at the first partition, their counts over the   much smaller than the partition size, especially as the item-
entire database will be available. These itemsets with their       set length increases.
counts are output (line 9). The OutputFinished function               Main memory consumed by the current partition is small
also performs the following task: If it outputs an itemset        compared to the above two factors. E.g., If each transaction
and  has no supersets left in  ,  is removed from  .           occupies 1KB, a partition of size 20K would require only
   During the second pass, the ReadNextPartition func-             20MB of memory. Even in these extreme examples, the
tion (line 13) reads tuples from the next partition and cre-       total memory consumption of ARMOR is 500MB, which is
ates tid-lists of singleton itemsets in  . After reading in the   acceptable on current machines.
entire partition, the Update2 function (details in [8]) is ap-        Therefore, in general we do not expect memory to be an
plied on each singleton in  (lines 14–15). Finally, before        issue for mining market-basket databases using ARMOR.
reading in the next partition we check to see if there are any     Further, even if it does happen to be an issue, it is easy to
more candidates. If not, the mining process terminates.            modify ARMOR to free space allocated to tidlists at the ex-
pense of time:    can be freed after line 3 in the
Update function shown in Figure 4.                                  Database                                   ARMOR                     Oracle           ARMOR /
   A final observation is that the main memory consump-             ( =10M)                        (%)            (seconds)               (seconds)           Oracle
tion of ARMOR is proportional to the size of the output             T10.I4                            0.1             371.44                  226.99             1.63
                                                                    T20.I12                           0.4            1153.42                  814.01             1.41
and does not “explode” as the input problem size increases.
                                                                    T40.I8                           1.15            2703.64                 2267.26             1.19

6 Experimental Results for ARMOR                                      Table 3. Worst-case Efficiency of ARMOR w.r.t
                                                                      Oracle
   We evaluated the performance of ARMOR with respect
to Oracle on a variety of databases and support characteris-
tics. We now report on our experimental results for the same
                                                                                                                          T10.I4.D10M
performance model described in Section 3. Since Apriori,
                                                                                                   160
FP-growth and VIPER have already been compared against                                                                                       1K items
Oracle in Section 3.1, we do not repeat those observations                                         140                                      20K items
here, but focus on the performance of ARMOR. This lends
                                                                                                   120


                                                                                Memory Used (MB)
to the visual clarity of the graphs. We hasten to add that
ARMOR does outperform the other algorithms.                                                        100

                                                                                                    80
6.1 Experiment 3: Performance of ARMOR
                                                                                                    60

    In this experiment, we evaluated the response time per-                                         40
formance of the ARMOR and Oracle algorithms for the                                                 20
T10.I4, T20.I12 and T40.I8 databases each containing 10M
transactions and these results are shown in Figures 9a–c.                                            0
                                                                                                         0       0.005        0.01           0.015      0.02
    In these graphs, we first see that ARMOR’s performance                                                               Support (as a %)
is close to that of Oracle for high supports. This is because
of the following reasons: The density of the frequent item-              Figure 10. Memory Utilization in ARMOR
set distribution is sparse at high supports resulting in only
a few frequent itemsets with supports “close” to  .
Hence, frequent itemsets are likely to be locally frequent         set the value of  to 20K items for the T10.I4 database –
within most partitions. Even if they are not locally fre-          this environment represents an extremely stressful situation
quent in a few partitions, it is very likely that they are still   for ARMOR with regard to memory utilization due to the
  -frequent over these partitions. Hence, their counters are       very large number of items. Figure 10 shows the memory
updated even over these partitions. Therefore, the complete        utilization of ARMOR as a function of support for the 
counts of most candidates would be available at the end of         = 1K and  = 20K cases. We see that the main memory
the first pass resulting in a “light and short” second pass.       utilization of ARMOR scales well with the number of items.
Hence, it is expected that the performance of ARMOR will           For example, at the 0.1% support threshold, the memory
be close to that of Oracle for high supports.                      consumption of ARMOR for  = 1K items was 104MB
    Since the frequent itemset distribution becomes dense          while for  = 20K items, it was 143MB – an increase in less
at low supports, the above argument does not hold in this          than 38% for a 20 times increase in the number of items!
support region. Hence we see that ARMOR’s performance              The reason for this is that the main memory utilization of
relative to Oracle decreases at low supports. But, what is         ARMOR does not depend directly on the number of items,
far more important is that ARMOR consistently performs             but only on the size of the output,   , as discussed in
within a factor of two of Oracle. This is highlighted in Ta-       Section 5.
ble 3 where we show the ratios of the performance of AR-
MOR to that of Oracle for the lowest support values consid-        6.3 Experiment 5: Real Datasets
ered for each of the databases.
                                                                      Despite repeated efforts, we were unable to obtain large
6.2 Experiment 4: Memory Utilization in AR-                        real datasets that conform to the sparse nature of market
    MOR                                                            basket data since such data is not publicly available due to
                                                                   proprietary reasons. The datasets in the UC Irvine public
  The previous experiments were conducted with the total           domain repository [3] which are commonly used in data
number of items,  , being set to 1K. In this experiment we        mining studies were not suitable for our purpose since they
                                                              (a) T10.I4.D10M                                                                        (b) T20.I12.D10M                                                 (c) T40.I8.D10M
                      400                                                                                                           1200                                                               3000
                                                                                       Oracle                                                                            Oracle                                                           Oracle
                      350                                                             ARMOR                                                                             ARMOR                                                            ARMOR
                                                                                                                                    1000                                                               2500
                      300
     Time (seconds)


                                                                                                                   Time (seconds)


                                                                                                                                                                                      Time (seconds)
                                                                                                                                    800                                                                2000
                      250

                      200                                                                                                           600                                                                1500

                      150
                                                                                                                                    400                                                                1000
                      100
                                                                                                                                    200                                                                 500
                       50

                        0                                                                                                              0                                                                  0
                              0                         0.5          1                    1.5                  2                           0   0.5          1            1.5      2                           0   1    2            3         4    5
                                                              Support (as a %)                                                                       Support (as a %)                                                 Support (as a %)


                                                                                  Figure 9. Performance of ARMOR (Synthetic Datasets)


                                                                          (b) EachMovie
                                                                                                                                                             are dense and have long patterns. We could however ob-
                                             9                                                                                                               tain two datasets – BMS-WebView-1, a clickstream data
                                                                                                 Oracle                                                      from Blue Martini Software [14] and EachMovie, a movie
                                             8                                                  ARMOR
                                                                                                                                                             database from Compaq Equipment Corporation [1], which
                                             7                                                                                                               we transformed to the format of boolean market basket data.
                                                                                                                                                             The resulting databases had 59,602 and 61,202 transactions
                            Time (seconds)


                                             6

                                             5                                                                                                               respectively with 870 and 1648 distinct items.
                                                                                                                                                                We set the rule support threshold values for the
                                             4
                                                                                                                                                             BMS-WebView-1 and EachMovie databases to the ranges
                                             3                                                                                                               (0.06%–0.1%) and (3%–10%), respectively. The results of
                                             2                                                                                                               these experiments are shown in Figures 11a–b. We see
                                             1                                                                                                               in these graphs that the performance of ARMOR contin-
                                                                                                                                                             ues to be within twice that of Oracle. The ratio of AR-
                                             0
                                                 3      4     5            6         7          8          9               10                                MOR’s performance to that of Oracle at the lowest support
                                                                         Support (as a %)                                                                    value of 0.06% for the BMS-WebView-1 database was 1.83
                                                                    (a) BMS-WebView-1
                                                                                                                                                             whereas at the lowest support value of 3% for the Each-
                                             9
                                                                                                                                                             Movie database it was 1.73.
                                                                                                 Oracle
                                             8                                                  ARMOR
                                                                                                                                                             6.4 Discussion of Experimental Results
                                             7
                                                                                                                                                                We now explain the reasons as to why ARMOR should
                            Time (seconds)


                                             6
                                                                                                                                                             typically perform within a factor of two of Oracle. First, we
                                             5
                                                                                                                                                             notice that the only difference between the single pass of
                                             4                                                                                                               Oracle and the first pass of ARMOR is that ARMOR con-
                                             3                                                                                                               tinuously generates and removes candidates. Since the gen-
                                             2                                                                                                               eration and removal of candidates in ARMOR is dynamic
                                                                                                                                                             and efficient, this does not result in a significant additional
                                             1
                                                                                                                                                             cost for ARMOR.
                                             0                                                                                                                  Since candidates in ARMOR that are neither -frequent
                                                     0.06         0.07            0.08              0.09                 0.1
                                                                         Support (as a %)                                                                    nor part of the current negative border are continuously re-
                                                                                                                                                             moved, any itemset that is locally frequent within a parti-
Figure 11. Performance of Armor (Real Datasets)                                                                                                              tion, but not globally frequent in the entire database is likely
                                                                                                                                                             to be removed from  during the course of the first pass
                                                                                                                                                             (unless it belongs to the current negative border). Hence the
                                                                                                                                                             resulting candidate set in ARMOR is a good approximation
of the required mining output. In fact, in our experiments,           We also presented a new online mining algorithm called
we found that in the worst case, the number of candidates          ARMOR (Association Rule Mining based on ORacle), that
counted in ARMOR was only about ten percent more than              was constructed with minimal changes to Oracle to result
the required mining output.                                        in an online algorithm. ARMOR utilizes a new method of
    The above two reasons indicate that the cost of the first      candidate generation that is dynamic and incremental and is
pass of ARMOR is only slightly more than that of (the sin-         guaranteed to complete in two passes over the database. Our
gle pass in) Oracle.                                               experimental results demonstrate that ARMOR performs
    Next, we notice that the only difference between the sec-      within a factor of two of Oracle.
ond pass of ARMOR and (the single pass in) Oracle is that
in ARMOR, candidates are continuously removed. Hence               Acknowledgments This work was partially supported by
the number of itemsets being counted in ARMOR during               a Swarnajayanti Fellowship from the Dept. of Science and
the second pass quickly reduces to much less than that of          Technology, Govt. of India.
Oracle. Moreover, ARMOR does not necessarily perform
a complete scan over the database during the second pass
                                                                   References
since the second pass ends when there are no more candi-
dates. Due to these reasons, we would expect that the cost          [1] Eachmovie         collaborative       filtering    data    set.
of the second pass of ARMOR is usually less than that of                http://www.research.compaq.com/SRC/eachmovie/, 1997.
(the single pass in) Oracle.                                        [2] R. Agrawal and R. Srikant. Fast algorithms for mining asso-
    Since the cost of the first pass of ARMOR is usually only           ciation rules. In Proc. of Intl. Conf. on Very Large Databases
slightly more than that of (the single pass in) Oracle and that         (VLDB), Sept. 1994.
of the second pass is usually less than that of (the single pass    [3] C.                                                          L.
in) Oracle, it follows that ARMOR will typically perform                Blake and C. J. Merz. UCI repository of machine learning
within a factor of two of Oracle.                                       databases. http://www.ics.uci.edu/ mlearn/MLRepository.
                                                                        html, 1998.
    In summary, due to the above reasons, it appears unlikely       [4] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without
for it to be possible to design algorithms that substantially           candidate generation. In Proc. of ACM SIGMOD Intl. Conf.
reduce either the number of database passes or the number               on Management of Data, May 2000.
of candidates counted. These represent the primary bottle-          [5] C. Hidber. Online association rule mining. In Proc. of ACM
necks in association rule mining. Further, since ARMOR                  SIGMOD Intl. Conf. on Management of Data, June 1999.
utilizes the same itemset counting technique of Oracle, fur-        [6] J. Lin and M. H. Dunham. Mining association rules: Anti-
ther overall improvement without domain knowledge seems                 skew algorithms. In Proc. of Intl. Conf. on Data Engineering
extremely difficult. Finally, even though we have not proved            (ICDE), 1998.
                                                                    [7] V. Pudi and J. Haritsa. Quantifying the utility of the past in
optimality of Oracle with respect to tidlist intersection, we           mining large databases. Information Systems, July 2000.
note that any smart intersection techniques that may be im-         [8] V. Pudi and J. Haritsa. On the optimality of association-rule
plemented in Oracle can also be used in ARMOR.                          mining algorithms. Technical Report TR-2001-01, DSL, In-
                                                                        dian Institute of Science, 2001.
                                                                    [9] A. Savasere, E. Omiecinski, and S. Navathe. An efficient
7 Conclusions                                                           algorithm for mining association rules in large databases. In
                                                                        Proc. of Intl. Conf. on Very Large Databases (VLDB), 1995.
    A variety of novel algorithms have been proposed in the        [10] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa,
recent past for the efficient mining of association rules, each         and D. Shah. Turbo-charging vertical mining of large
in turn claiming to outperform its predecessors on a set of             databases. In Proc. of ACM SIGMOD Intl. Conf. on Man-
standard databases. In this paper, our approach was to quan-            agement of Data, May 2000.
                                                                   [11] R. Srikant and R. Agrawal. Mining generalized associa-
tify the algorithmic performance of association rule mining
                                                                        tion rules. In Proc. of Intl. Conf. on Very Large Databases
algorithms with regard to an idealized, but practically in-
                                                                        (VLDB), Sept. 1995.
feasible, “Oracle”. The Oracle algorithm utilizes a parti-         [12] Y. Xiao and M. H. Dunham. Considering main memory in
tioning strategy to determine the supports of itemsets in the           mining association rules. In Proc. of Intl. Conf. on Data
required output. It uses direct lookup arrays for counting              Warehousing and Knowledge Discovery (DAWAK), 1999.
singletons and pairs and a DAG data-structure for count-           [13] M. J. Zaki and K. Gouda. Fast vertical mining using diff-
ing longer itemsets. We have shown that these choices are               sets. Technical Report 01-1, Rensselaer Polytechnic Insti-
optimal in that only required itemsets are enumerated and               tute, 2001.
                                                                   [14] Z. Zheng, R. Kohavi, and L. Mason. Real world perfor-
that the cost of enumerating each itemset is . Our ex-
                                                                        mance of association rule algorithms. In Proc. of Intl. Conf.
perimental results showed that there was a substantial gap
                                                                        on Knowledge Discovery and Data Mining (KDD), Aug.
between the performance of current mining algorithms and                2001.
that of the Oracle.