ARMOR: Association Rule Mining based on ORacle Vikram Pudi Jayant R. Haritsa Intl. Inst. of Information Technology Database Systems Lab, SERC Hyderabad, India Indian Institute of Science vikram@iiit.net Bangalore, India haritsa@dsl.serc.iisc.ernet.in Abstract ward. In particular, it is critically dependent on the choice of data structures used during the counting process. We In this paper, we first focus our attention on the question present a carefully engineered implementation of Oracle of how much space remains for performance improvement that makes the best choices for these design parameters at over current association rule mining algorithms. Our strat- each stage of the counting process. Our experimental results egy is to compare their performance against an “Oracle al- show that there is a considerable gap in the performance be- gorithm” that knows in advance the identities of all frequent tween the Oracle and existing mining algorithms. itemsets in the database and only needs to gather their ac- Second, we present a new mining algorithm, called AR- tual supports to complete the mining process. Our experi- MOR (Association Rule Mining based on ORacle), whose mental results show that current mining algorithms do not structure is derived by making minimal changes to the Or- perform uniformly well with respect to the Oracle for all acle, and is guaranteed to complete in two passes over the database characteristics and support thresholds. In many database. Although ARMOR is derived from the Oracle, it cases there is a substantial gap between the Oracle’s perfor- may be seen to share the positive features of a variety of mance and that of the current mining algorithms. Second, previous algorithms such as PARTITION [9], CARMA [5], we present a new mining algorithm, called ARMOR, that is AS-CPA [6], VIPER [10] and DELTA [7]. Our empirical constructed by making minimal changes to the Oracle al- study shows that ARMOR consistently performs within a gorithm. ARMOR consistently performs within a factor of factor of two of the Oracle, over both real (BMS-WebView- two of the Oracle on both real and synthetic datasets over 1 [14] from Blue Martini Software) and synthetic databases practical ranges of support specifications. (from the IBM Almaden generator [2]) over practical ranges of support specifications. The environment we consider is one where the pattern 1 Introduction lengths in the database are small enough that the size of mining results is comparable to the available main memory. We focus our attention on the question of how much This holds when the mined data conforms to the sparse na- space remains for performance improvement over current ture of market basket data for which association rule mining association rule mining algorithms. Our approach is to com- was originally intended. It is perhaps inappropriate to ap- pare their performance against an “Oracle algorithm” that ply the problem of mining all frequent itemsets on dense knows in advance the identities of all frequent itemsets in datasets. the database and only needs to gather the actual supports For ease of exposition, we will use the notation shown in of these itemsets to complete the mining process. Clearly, Table 1 in the remainder of this paper. any practical algorithm will have to do at least this much work in order to generate mining rules. This “Oracle ap- Organization The remainder of this paper is organized as proach” permits us to clearly demarcate the maximal space follows: The design of the Oracle algorithm is described in available for performance improvement over the currently Section 2 and is used to evaluate the performance of current available algorithms. Further, it enables us to construct new algorithms in Section 3. Our new ARMOR algorithm is mining algorithms from a completely different perspective, presented in Section 4 and its main memory requirements namely, as minimally-altered derivatives of the Oracle. are discussed in Section 5. The performance of ARMOR is First, we show that while the notion of the Oracle is con- evaluated in Section 6. Finally, in Section 7, we summarize ceptually simple, its construction is not equally straightfor- the conclusions of our study. Database of customer purchase transactions ArrayCount (  ½ ¾ )  User-specified minimum rule support Input: Transaction , 1-itemsets Array ½ , 2-itemsets Array ¾  Set of frequent itemsets in Output: Arrays ½ and ¾ with their counts updated over   Set of itemsets in the negative border of Itemset  = null; ½   1. Set of disjoint partitions of // to store frequent items from in Item-List format Transactions in partitions scanned so far during 2.  for each item in transaction algorithm execution excluding the current partition 3.    ½   ; · Transactions in partitions scanned so far during 4. if ½      null algorithm execution including the current partition 5.  append to   DAG structure to store candidates 6.  for = 1 to    // enumerate 2-itemsets 7.   for =   to    Table 1. Notation 8.       ½ = ½    // row index 9.       ¾ = ½    // column index 10. ¾    ½ ¾   ; 2 The Oracle Algorithm Figure 1. Counting Singletons and Pairs in Oracle In this section we present the Oracle algorithm which, as mentioned in the Introduction, “magically” knows in ad- vance the identities of all frequent itemsets in the database Algorithm Description The ArrayCount function shown and only needs to gather the actual supports of these item- in Figure 1 takes as inputs, a transaction along with  ½ sets. Clearly, any practical algorithm will have to do at least and ¾ , and updates the counters of these arrays over . In this much work in order to generate mining rules. Oracle the ArrayCount function, the individual items in the trans- takes as input the database, in item-list format (which action are enumerated (lines 2–5) and for each item, its is organized as a set of rows with each row storing an or- corresponding count in  ½ is incremented (line 3). During dered list of item-identifiers (IID), representing the items this process, the frequent items in are stored in a separate purchased in the transaction), the set of frequent itemsets, itemset (line 5). We then enumerate all pairs of items , and its corresponding negative border,  , and outputs contained in (lines 6–10) and increment the counters of the supports of these itemsets by making one scan over the the corresponding 2-itemsets in  ¾ (lines 8–10). database. We first describe the mechanics of the Oracle al- gorithm below and then move on to discuss the rationale behind its design choices in Section 2.2. 2.1.2 Counting  -itemsets,   Data-Structure Description Itemsets in   of length 2.1 The Mechanics of Oracle greater than 2 and their related information (counters, etc.) are stored in a DAG structure  , which is pictorially shown For ease of exposition, we first present the manner in in Figure 2 for a database with items A, B, C, D. Al- which Oracle computes the supports of 1-itemsets and 2- though singletons and pairs are stored in lookup arrays, as itemsets and then move on to longer itemsets. Note, how- mentioned before, for expository ease, we assume that they ever, that the algorithm actually performs all these compu- too are stored in  in the remainder of this discussion. tations concurrently in one scan over the database. Each itemset is stored in a separate node of  and is linked to the first two (in a lexicographic ordering) of its subsets. We use the terms “mother” and “father” of an item- 2.1.1 Counting Singletons and Pairs set to refer to the (lexicographically) smaller subset and the (lexicographically) larger subset, respectively. E.g., A, B Data-Structure Description The counters of singletons and A, C are the mother and father respectively of A, B, (1-itemsets) are maintained in a 1-dimensional lookup ar- C. For each itemset  in  , we also store with it links to ray, ½ , and that of pairs (2-itemsets), in a lower triangu- those supersets of  for which  is a mother. We call this lar 2-dimensional lookup array,  ¾ (Similar arrays are also list of links as childset. used in Apriori [2, 11] for its first two passes.) The  th entry Since each itemset is stored in a separate node in the in the array ½ contains two fields: (1) , the counter DAG, we use the terms “itemset” and “node” interchange- for the itemset  corresponding to the  th item, and (2) ably in the remainder of this discussion. Also, we use  to  , the number of frequent itemsets prior to  in  ½ , if denote the set of itemsets that are stored in the DAG struc-  is frequent; null, otherwise. ture  . A B C D Oracle ( ,  ) Input: Database , Itemsets to be Counted      Output: Itemsets in  with Supports AB AC AD BC BD CD 1. = Number of Partitions 2.  for = 1 to  3.  ReadNextPartition( ,  ); ABC ABD ACD BCD 4.  for each singleton in   mother 5. Update( ); father ABCD Figure 3. The Oracle Algorithm Figure 2. DAG Structure Containing Power Set of  Update ( ) A,B,C,D Input: DAG Node  Output:  and its Descendents with Counts Updated 1.  = convert   to Tid-vector format  // is statically allocated Algorithm Description We use a partitioning [9] scheme 2.     for each node in wherein the database is logically divided into  disjoint hor- 3.         = Intersect( , ); izontal partitions ½  ¾    . In this scheme, itemsets 4.      +=   5.     for each node in being counted are enumerated only at the end of each par- tition and not after every tuple. Each partition is as large as 6.  Update( ); can fit in available main memory. For ease of exposition, we Figure 4. Updating Counts assume that the partitions are equi-sized. However, we has- ten to add that the technique is easily extendible to arbitrary  Intersect ( , )  partition sizes. Input: Tid-vector , Tid-list The pseudo-code of Oracle is shown in Figure 3 and op- Output:   erates as follows: The ReadNextPartition function (line 3) 1. Tid-list  = reads tuples from the next partition and simultaneously cre- 2. for each  in ates tid-lists½ (within that partition) of singleton itemsets in  . Note that this conversion of the database to the tid-list 3.  =    (tid of first transaction in current partition) (TL) format is an on-the-fly operation and does not change 4. if    = 1 then the complexity of Oracle by more than a (small) constant 5.    =  factor. Next, the Update function (line 5) is applied on 6. return  each singleton in  . This function takes a node  in  as input and updates the counts of all descendants of  to re- Figure 5. Intersection flect their counts over the current partition. The count of any itemset within a partition is equal to the length of its corresponding tidlist (within that partition). The tidlist of an itemset can be obtained as the intersection of the tidlists The above processing is captured in the function Update of its mother and father and this process is started off using whose pseudo-code is shown in Figure 4. Here, the tidlist the tidlists of frequent 1-itemsets. The exact details of tidlist of a given node  is first converted to the tid-vector (TV) computation are discussed later. format¾ (line 1). Then, tidlists of all children of  are com- We now describe the manner in which the itemsets in  puted (lines 2–4) after which the same children are visited in a depth first search (lines 5–6). Ë are enumerated after reading in a new partition. The set of links,      , induce a spanning tree of  (e.g. The mechanics of tidlist computation, as promised ear- consider only the solid edges in Figure 2). We perform a lier, are given in Figure 5. The Intersect function shown depth first search on this spanning tree to enumerate all its here takes as input a tid-vector  and a tid-list . Each  itemsets. When a node in the tree is visited, we compute the in is added to the result if    is 1 (lines 2–5) where tidlists of all its children. This ensures that when an itemset  is defined in line 3 and represents the position of the is visited, the tidlists of its mother and father have already transaction relative to the current partition. been computed. ¾ A tid-vector of an itemset is a bit-vector of 1’s and 0’s to rep- ½ A tid-list of an itemset is an ordered list of TIDs of transactions that resent the presence or absence respectively, of in the set of customer contain . transactions. 2.2 Rationale for the Oracle Design 3 Performance Study We show that Oracle is optimal in two respects: (1) It In the previous section, we have described the Oracle al- enumerates only those itemsets in  that need to be enu- gorithm. In order to assess the performance of current min- merated, and (2) The enumeration is performed in the most ing algorithms with respect to the Oracle algorithm, we have efficient way possible. These results are based on the fol- chosen VIPER [10] and FP-growth [4], among the latest in lowing two theorems. Due to lack of space we have deferred the suite of online mining algorithms. For completeness the proofs of theorems to [8]. and as a reference point, we have also included the classical Apriori in our evaluation suite. Our experiments cover a range of database and min- Theorem 2.1 If the size of each partition is large enough that every itemset in   of length greater than 2 is ing workloads, and include the typical and extreme cases considered in previous studies – the only difference is that present at least once in it, then the only itemsets being enu- we also consider database sizes that are significantly larger merated in the Oracle algorithm are those whose counts than the available main memory. The performance metric need to be incremented in that partition. in all the experiments is the total execution time taken by the mining operation. Theorem 2.2 The cost of enumerating each itemset in Or- The databases used in our experiments were syntheti- acle is  with a tight constant factor. cally generated using the technique described in [2] and at- tempt to mimic the customer purchase behavior seen in re- tailing environments. The parameters used in the synthetic While Oracle is optimal in these respects, we note that generator and their default values are described in Table 2. there may remain some scope for improvement in the details In particular, we consider databases with parameters T10.I4, of tidlist computation. That is, the Intersect function (Fig- T20.I12 and T40.I8 with 10 million tuples in each of them. ure 5) which computes the intersection of a tid-vector  and a tid-list requires   operations.  itself was origi- nally constructed from a tid-list, although this cost is amor- Parameter Meaning Default Values tized over many calls to the Intersect function. We plan  Number of items 1000 Mean transaction length 10, 20, 40 to investigate in our future work whether the intersection of two sets can, in general, be computed more efficiently – for  Number of potentially 2000 frequent itemsets example, using diffsets, a novel and interesting approach suggested in [13]. The diffset of an itemset  is the set-  Mean length of potentially 4, 8, 12 frequent itemsets difference of the tid-list of  from that of its mother. Diff-  Number of transactions 10M sets can be easily incorporated in Oracle – only the Update in the database function in Figure 4 of Section 2 is to be changed to com- pute diffsets instead of tidlists by following the techniques Table 2. Parameter Table suggested in [13]. We set the rule support threshold values to as low as Advantages of Partitioning Schemes Oracle, as dis- was feasible with the available main memory. At these low cussed above, uses a partitioning scheme. An alternative support values the number of frequent itemsets exceeded commonly used in current association rule mining algo- twenty five thousand! Beyond this, we felt that the number rithms, especially in hashtree [2] based schemes, is to use a of rules generated would be enormous and the purpose of tuple-by-tuple approach. A problem with the tuple-by-tuple mining – to find interesting patterns – would not be served. approach, however, is that there is considerable wasted enu- In particular, we set the rule support threshold values for the meration of itemsets. The core operation in these algorithms T10.I4, T20.I12 and T40.I8 databases to the ranges (0.1%– is to determine all candidates that are subsets of the current 2%), (0.4%–2%) and (1.15%–5%), respectively. transaction. Given that a frequent itemset  is present in Our experiments were conducted on a 700-MHz Pentium the current transaction, we need to determine all candidates III workstation running Red Hat Linux 6.2, configured with that are immediate supersets of  and are also present in a 512 MB main memory and a local 18 GB SCSI 10000 the current transaction. In order to achieve this, it is often rpm disk. For the T10.I4, T20.I12 and T40.I8 databases, necessary to enumerate and check for the presence of many the associated database sizes were approximately 500MB, more candidates than those that are actually present in the 900MB and 1.7 GB, respectively. All the algorithms in our current transaction. evaluation suite are written in C++. We implemented a ba- sic version of the FP-growth algorithm ¿ wherein we assume small databases consisting of 100K transactions. The re- that the entire FP-tree data structure fits in main memory. sults of this experiment are shown in Figures 7a–c, which Finally, the partition size in Oracle was fixed to be 20K tu- correspond to the T10.I4, T20.I12 and T40.I8 databases, re- ples. spectively. In these graphs, we see there continues to be a con- 3.1 Experimental Results for Current Mining Al- siderable gap in the performance of current mining algo- gorithms rithms with respect to Oracle. For example, for the T40.I8 database, the response time of FP-growth is more than 8 We now report on our experimental results. We con- times that of Oracle for the entire support threshold range. ducted two experiments to evaluate the performance of cur- rent mining algorithms with respect to the Oracle. Our 4 The ARMOR Algorithm first experiment was run on large (10M tuples) databases, while our second experiment was run on small (100K tu- ples) databases. ARMOR ( ,   )  Input: Database , Set of Items , Minimum Support  3.1.1 Experiment 1: Performance of Current Algo-   Output:  with Supports rithms 1. = Number of Partitions In our first experiment, we evaluated the performance of //—– First Pass —– Apriori, VIPER and Oracle algorithms for the T10.I4, 2. =  // candidate set (in a DAG) T20.I12 and T40.I8 databases each containing 10M trans- 3.  for = 1 to  actions and these results are shown in Figures 6a–c. The 4.  ReadNextPartition( ,  ); x-axis in these graphs represent the support threshold val- 5.  for each singleton in  ues while the y-axis represents the response times of the 6.      +=   algorithms being evaluated. 7.   Update1( , ); In these graphs, we see that the response times of all al- //—– Second Pass —– gorithms increase exponentially as the support threshold is 8. RemoveSmall( ,  ); reduced. This is only to be expected since the number of 9. OutputFinished( ,  ); itemsets in the output,   , increases exponentially with 10.  for = 1 to  decrease in the support threshold. 11. if (all candidates in  have been output) We also see that there is a considerable gap in the perfor- 12. exit mance of both Apriori and VIPER with respect to Oracle. 13.  ReadNextPartition( ,  ); For example, in Figure 6a, at a support threshold of 0.1%, 14.  for each singleton in  the response time of VIPER is more than 6 times that of Or- 15.   Update2( , ); acle whereas the response time of Apriori is more than 26 times! Figure 8. The ARMOR Algorithm In this experiment, we could not evaluate the perfor- mance of FP-growth because it did not complete in any of In the previous section, our experimental results have our runs on large databases due to its heavy and database shown that there is a considerable gap in the performance size dependent utilization of main memory. The reason for between the Oracle and existing mining algorithms. We this is that FP-growth stores the database itself in a con- now move on to describe our new mining algorithm, AR- densed representation in a data structure called FP-tree. In MOR (Association Rule Mining based on ORacle). In this [4], the authors briefly discuss the issue of constructing section, we overview the main features and the flow of exe- disk-resident FP-trees. We however, did not take this into cution of ARMOR – the details of candidate generation are account in our implementation. deferred to [8] due to lack of space. The guiding principle in our design of the ARMOR algo- 3.1.2 Experiment 2: Small Databases rithm is that we consciously make an attempt to determine the minimal amount of change to Oracle required to result Since, as mentioned above, it was not possible for us to eval- in an online algorithm. This is in marked contrast to the ear- uate the performance of FP-growth on large databases due lier approaches which designed new algorithms by trying to to its heavy utilization of main memory, we evaluated the address the limitations of previous online algorithms. That performance of FP-growth and other current algorithms on is, we approach the association rule mining problem from a ¿ The original implementation by Han et al. was not available. completely different perspective. (a) T10.I4.D10M (b) T20.I12.D10M (c) T40.I8.D10M 7000 18000 25000 Oracle Oracle Oracle Apriori 16000 Apriori Apriori 6000 VIPER VIPER VIPER 14000 20000 5000 Time (seconds) Time (seconds) Time (seconds) 12000 15000 4000 10000 3000 8000 10000 6000 2000 4000 5000 1000 2000 0 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 1 2 3 4 5 Support (as a %) Support (as a %) Support (as a %) Figure 6. Performance of Current Algorithms (Large Databases) (a) T10.I4.D100K (b) T20.I12.D100K (c) T40.I8.D100K 70 180 450 Oracle Oracle Oracle Apriori 160 Apriori 400 Apriori 60 VIPER VIPER VIPER FP-growth 140 FP-growth 350 FP-growth 50 Time (seconds) Time (seconds) Time (seconds) 120 300 40 100 250 30 80 200 60 150 20 40 100 10 20 50 0 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 1 2 3 4 5 Support (as a %) Support (as a %) Support (as a %) Figure 7. Performance of Current Algorithms (Small Databases) In ARMOR, as in Oracle, the database is conceptually stored in the DAG. partitioned into  disjoint blocks  ½  ¾    . At most Along with each candidate  , we also store the follow- two passes are made over the database. In the first pass we form a set of candidate itemsets,  , that is guaranteed to ing three integers as in the CARMA algorithm [5]: (1)  : the number of occurrences of  since  was be a superset of the set of frequent itemsets. During the last inserted in  . (2)      : the index of the first pass, the counts of candidates in  are determined over partition at which  was inserted in  . (3)    each partition in exactly the same way as in Oracle by main- : upper bound on the number of occurrences of  before  taining the candidates in a DAG structure. The 1-itemsets was inserted in  .      and    and 2-itemsets are stored in lookup arrays as in Oracle. But are computed when  is inserted into  in a manner identi- unlike in Oracle, candidates are inserted and removed from  at the end of each partition. Generation and removal of cal to CARMA. candidates is done simultaneously while computing counts. While the CARMA algorithm works on a tuple-by-tuple The details of candidate generation and removal during the basis, we have adapted the semantics of these fields to suit first pass are described in [8] due to lack of space. For ease the partitioning approach. If the database scanned so far is of exposition we assume in the remainder of this section (refer Table 1), then the support of any candidate  in that all candidates (including 1-itemsets and 2-itemsets) are  will lie in the range         [5]. These bounds are denoted by minSupport( ) and maxSupport( ), respectively. We 5 Memory Utilization in ARMOR define an itemset  to be -frequent if minSupport    . Unlike in the CARMA algorithm where only - In the design and implementation of ARMOR, we have frequent itemsets are stored at any stage, the DAG structure opted for speed in most decisions that involve a space-speed in ARMOR contains other candidates, including the neg- tradeoff. Therefore, the main memory utilization in AR- ative border of the -frequent itemsets, to ensure efficient MOR is certainly more as compared to algorithms such as candidate generation. The details are given in [8]. Apriori. However, in the following discussion, we show that At the end of the first pass, the candidate set  is pruned the memory usage of ARMOR is well within the reaches of to include only -frequent itemsets and their negative bor- current machine configurations. This is also experimentally der. The counts of itemsets in  over the entire database confirmed in the next section. are determined during the second pass. The counting pro- The main memory consumption of ARMOR comes from cess is again identical to that of Oracle. No new candidates the following sources: (1) The 1-d and 2-d arrays for stor- are generated during the second pass. However, candidates ing counters of singletons and pairs, respectively; (2) The may be removed. The details of candidate removal in the DAG structure for storing counters (and tidlists) of longer second pass is deferred to [8]. itemsets, and (3) The current partition. The pseudo-code of ARMOR is shown in Figure 8 and The total number of entries in the 1-d and 2-d arrays and is explained below. in the DAG structure corresponds to the number of candi- dates in ARMOR, which as we have discussed in [8], is 4.1 First Pass only marginally more than    . For the moment, if we disregard the space occupied by tidlists of itemsets, then At the beginning of the first pass, the set of candidate the amortized amount of space taken by each candidate is a itemsets  is initialized to the set of singleton itemsets (line small constant (about 10 integers for the dag and 1 integer 2). The ReadNextPartition function (line 4) reads tuples for the arrays). E.g., if there are 1 million candidates in the from the next partition and simultaneously creates tid-lists dag and 10 million in the array, the space required is about of singleton itemsets in  . 80MB. Since the environment we consider is one where After reading in the entire partition, the Update1 func- the pattern lengths are small, the number of candidates will tion (details in [8]) is applied on each singleton in  (lines typically be well within the available main memory. [12] 5–7). It increments the counts of existing candidates by discusses alternative approaches when this assumption does their corresponding counts in the current partition. It is also not hold. responsible for generation and removal of candidates. Regarding the space occupied by tidlists of itemsets, note At the end of the first pass,  contains a superset of that ARMOR only needs to store tidlists of -frequent item- the set of frequent itemsets. For a candidate in  that has sets. The number of -frequent itemsets is of the same or- der as the number of frequent itemsets,  . The total space been inserted at partition   , its count over the partitions     will be available. occupied by tidlists while processing partition   is then bounded by       integers. E.g., if     and     , then the space occupied by tidlists is bounded 4.2 Second Pass by about 400MB. We assume   to be in the range of a few thousands at most because otherwise the total number At the beginning of the second pass, candidates in  that of rules generated would be enormous and the purpose of are neither -frequent nor part of the current negative bor- mining would not be served. Note that the above bound der are removed from  (line 8). For candidates that have is very pessimistic. Typically, the lengths of tidlists are been inserted in  at the first partition, their counts over the much smaller than the partition size, especially as the item- entire database will be available. These itemsets with their set length increases. counts are output (line 9). The OutputFinished function Main memory consumed by the current partition is small also performs the following task: If it outputs an itemset  compared to the above two factors. E.g., If each transaction and  has no supersets left in  ,  is removed from  . occupies 1KB, a partition of size 20K would require only During the second pass, the ReadNextPartition func- 20MB of memory. Even in these extreme examples, the tion (line 13) reads tuples from the next partition and cre- total memory consumption of ARMOR is 500MB, which is ates tid-lists of singleton itemsets in  . After reading in the acceptable on current machines. entire partition, the Update2 function (details in [8]) is ap- Therefore, in general we do not expect memory to be an plied on each singleton in  (lines 14–15). Finally, before issue for mining market-basket databases using ARMOR. reading in the next partition we check to see if there are any Further, even if it does happen to be an issue, it is easy to more candidates. If not, the mining process terminates. modify ARMOR to free space allocated to tidlists at the ex- pense of time:    can be freed after line 3 in the Update function shown in Figure 4. Database  ARMOR Oracle ARMOR / A final observation is that the main memory consump- ( =10M) (%) (seconds) (seconds) Oracle tion of ARMOR is proportional to the size of the output T10.I4 0.1 371.44 226.99 1.63 T20.I12 0.4 1153.42 814.01 1.41 and does not “explode” as the input problem size increases. T40.I8 1.15 2703.64 2267.26 1.19 6 Experimental Results for ARMOR Table 3. Worst-case Efficiency of ARMOR w.r.t Oracle We evaluated the performance of ARMOR with respect to Oracle on a variety of databases and support characteris- tics. We now report on our experimental results for the same T10.I4.D10M performance model described in Section 3. Since Apriori, 160 FP-growth and VIPER have already been compared against 1K items Oracle in Section 3.1, we do not repeat those observations 140 20K items here, but focus on the performance of ARMOR. This lends 120 Memory Used (MB) to the visual clarity of the graphs. We hasten to add that ARMOR does outperform the other algorithms. 100 80 6.1 Experiment 3: Performance of ARMOR 60 In this experiment, we evaluated the response time per- 40 formance of the ARMOR and Oracle algorithms for the 20 T10.I4, T20.I12 and T40.I8 databases each containing 10M transactions and these results are shown in Figures 9a–c. 0 0 0.005 0.01 0.015 0.02 In these graphs, we first see that ARMOR’s performance Support (as a %) is close to that of Oracle for high supports. This is because of the following reasons: The density of the frequent item- Figure 10. Memory Utilization in ARMOR set distribution is sparse at high supports resulting in only a few frequent itemsets with supports “close” to  . Hence, frequent itemsets are likely to be locally frequent set the value of  to 20K items for the T10.I4 database – within most partitions. Even if they are not locally fre- this environment represents an extremely stressful situation quent in a few partitions, it is very likely that they are still for ARMOR with regard to memory utilization due to the -frequent over these partitions. Hence, their counters are very large number of items. Figure 10 shows the memory updated even over these partitions. Therefore, the complete utilization of ARMOR as a function of support for the  counts of most candidates would be available at the end of = 1K and  = 20K cases. We see that the main memory the first pass resulting in a “light and short” second pass. utilization of ARMOR scales well with the number of items. Hence, it is expected that the performance of ARMOR will For example, at the 0.1% support threshold, the memory be close to that of Oracle for high supports. consumption of ARMOR for  = 1K items was 104MB Since the frequent itemset distribution becomes dense while for  = 20K items, it was 143MB – an increase in less at low supports, the above argument does not hold in this than 38% for a 20 times increase in the number of items! support region. Hence we see that ARMOR’s performance The reason for this is that the main memory utilization of relative to Oracle decreases at low supports. But, what is ARMOR does not depend directly on the number of items, far more important is that ARMOR consistently performs but only on the size of the output,   , as discussed in within a factor of two of Oracle. This is highlighted in Ta- Section 5. ble 3 where we show the ratios of the performance of AR- MOR to that of Oracle for the lowest support values consid- 6.3 Experiment 5: Real Datasets ered for each of the databases. Despite repeated efforts, we were unable to obtain large 6.2 Experiment 4: Memory Utilization in AR- real datasets that conform to the sparse nature of market MOR basket data since such data is not publicly available due to proprietary reasons. The datasets in the UC Irvine public The previous experiments were conducted with the total domain repository [3] which are commonly used in data number of items,  , being set to 1K. In this experiment we mining studies were not suitable for our purpose since they (a) T10.I4.D10M (b) T20.I12.D10M (c) T40.I8.D10M 400 1200 3000 Oracle Oracle Oracle 350 ARMOR ARMOR ARMOR 1000 2500 300 Time (seconds) Time (seconds) Time (seconds) 800 2000 250 200 600 1500 150 400 1000 100 200 500 50 0 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 1 2 3 4 5 Support (as a %) Support (as a %) Support (as a %) Figure 9. Performance of ARMOR (Synthetic Datasets) (b) EachMovie are dense and have long patterns. We could however ob- 9 tain two datasets – BMS-WebView-1, a clickstream data Oracle from Blue Martini Software [14] and EachMovie, a movie 8 ARMOR database from Compaq Equipment Corporation [1], which 7 we transformed to the format of boolean market basket data. The resulting databases had 59,602 and 61,202 transactions Time (seconds) 6 5 respectively with 870 and 1648 distinct items. We set the rule support threshold values for the 4 BMS-WebView-1 and EachMovie databases to the ranges 3 (0.06%–0.1%) and (3%–10%), respectively. The results of 2 these experiments are shown in Figures 11a–b. We see 1 in these graphs that the performance of ARMOR contin- ues to be within twice that of Oracle. The ratio of AR- 0 3 4 5 6 7 8 9 10 MOR’s performance to that of Oracle at the lowest support Support (as a %) value of 0.06% for the BMS-WebView-1 database was 1.83 (a) BMS-WebView-1 whereas at the lowest support value of 3% for the Each- 9 Movie database it was 1.73. Oracle 8 ARMOR 6.4 Discussion of Experimental Results 7 We now explain the reasons as to why ARMOR should Time (seconds) 6 typically perform within a factor of two of Oracle. First, we 5 notice that the only difference between the single pass of 4 Oracle and the first pass of ARMOR is that ARMOR con- 3 tinuously generates and removes candidates. Since the gen- 2 eration and removal of candidates in ARMOR is dynamic and efficient, this does not result in a significant additional 1 cost for ARMOR. 0 Since candidates in ARMOR that are neither -frequent 0.06 0.07 0.08 0.09 0.1 Support (as a %) nor part of the current negative border are continuously re- moved, any itemset that is locally frequent within a parti- Figure 11. Performance of Armor (Real Datasets) tion, but not globally frequent in the entire database is likely to be removed from  during the course of the first pass (unless it belongs to the current negative border). Hence the resulting candidate set in ARMOR is a good approximation of the required mining output. In fact, in our experiments, We also presented a new online mining algorithm called we found that in the worst case, the number of candidates ARMOR (Association Rule Mining based on ORacle), that counted in ARMOR was only about ten percent more than was constructed with minimal changes to Oracle to result the required mining output. in an online algorithm. ARMOR utilizes a new method of The above two reasons indicate that the cost of the first candidate generation that is dynamic and incremental and is pass of ARMOR is only slightly more than that of (the sin- guaranteed to complete in two passes over the database. Our gle pass in) Oracle. experimental results demonstrate that ARMOR performs Next, we notice that the only difference between the sec- within a factor of two of Oracle. ond pass of ARMOR and (the single pass in) Oracle is that in ARMOR, candidates are continuously removed. Hence Acknowledgments This work was partially supported by the number of itemsets being counted in ARMOR during a Swarnajayanti Fellowship from the Dept. of Science and the second pass quickly reduces to much less than that of Technology, Govt. of India. Oracle. Moreover, ARMOR does not necessarily perform a complete scan over the database during the second pass References since the second pass ends when there are no more candi- dates. Due to these reasons, we would expect that the cost [1] Eachmovie collaborative filtering data set. of the second pass of ARMOR is usually less than that of http://www.research.compaq.com/SRC/eachmovie/, 1997. (the single pass in) Oracle. [2] R. Agrawal and R. Srikant. Fast algorithms for mining asso- Since the cost of the first pass of ARMOR is usually only ciation rules. In Proc. of Intl. Conf. on Very Large Databases slightly more than that of (the single pass in) Oracle and that (VLDB), Sept. 1994. of the second pass is usually less than that of (the single pass [3] C. L. in) Oracle, it follows that ARMOR will typically perform Blake and C. J. Merz. UCI repository of machine learning within a factor of two of Oracle. databases. http://www.ics.uci.edu/ mlearn/MLRepository. html, 1998. In summary, due to the above reasons, it appears unlikely [4] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without for it to be possible to design algorithms that substantially candidate generation. In Proc. of ACM SIGMOD Intl. Conf. reduce either the number of database passes or the number on Management of Data, May 2000. of candidates counted. These represent the primary bottle- [5] C. Hidber. Online association rule mining. In Proc. of ACM necks in association rule mining. Further, since ARMOR SIGMOD Intl. Conf. on Management of Data, June 1999. utilizes the same itemset counting technique of Oracle, fur- [6] J. Lin and M. H. Dunham. Mining association rules: Anti- ther overall improvement without domain knowledge seems skew algorithms. In Proc. of Intl. Conf. on Data Engineering extremely difficult. Finally, even though we have not proved (ICDE), 1998. [7] V. Pudi and J. Haritsa. Quantifying the utility of the past in optimality of Oracle with respect to tidlist intersection, we mining large databases. Information Systems, July 2000. note that any smart intersection techniques that may be im- [8] V. Pudi and J. Haritsa. On the optimality of association-rule plemented in Oracle can also be used in ARMOR. mining algorithms. Technical Report TR-2001-01, DSL, In- dian Institute of Science, 2001. [9] A. Savasere, E. Omiecinski, and S. Navathe. An efficient 7 Conclusions algorithm for mining association rules in large databases. In Proc. of Intl. Conf. on Very Large Databases (VLDB), 1995. A variety of novel algorithms have been proposed in the [10] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, recent past for the efficient mining of association rules, each and D. Shah. Turbo-charging vertical mining of large in turn claiming to outperform its predecessors on a set of databases. In Proc. of ACM SIGMOD Intl. Conf. on Man- standard databases. In this paper, our approach was to quan- agement of Data, May 2000. [11] R. Srikant and R. Agrawal. Mining generalized associa- tify the algorithmic performance of association rule mining tion rules. In Proc. of Intl. Conf. on Very Large Databases algorithms with regard to an idealized, but practically in- (VLDB), Sept. 1995. feasible, “Oracle”. The Oracle algorithm utilizes a parti- [12] Y. Xiao and M. H. Dunham. Considering main memory in tioning strategy to determine the supports of itemsets in the mining association rules. In Proc. of Intl. Conf. on Data required output. It uses direct lookup arrays for counting Warehousing and Knowledge Discovery (DAWAK), 1999. singletons and pairs and a DAG data-structure for count- [13] M. J. Zaki and K. Gouda. Fast vertical mining using diff- ing longer itemsets. We have shown that these choices are sets. Technical Report 01-1, Rensselaer Polytechnic Insti- optimal in that only required itemsets are enumerated and tute, 2001. [14] Z. Zheng, R. Kohavi, and L. Mason. Real world perfor- that the cost of enumerating each itemset is . Our ex- mance of association rule algorithms. In Proc. of Intl. Conf. perimental results showed that there was a substantial gap on Knowledge Discovery and Data Mining (KDD), Aug. between the performance of current mining algorithms and 2001. that of the Oracle.