APRIORI, A Depth First Implementation Walter A. Kosters Wim Pijls Leiden Institute of Advanced Computer Science Department of Computer Science Universiteit Leiden Erasmus University P.O. Box 9512, 2300 RA Leiden P.O. Box 1738, 3000 DR Rotterdam The Netherlands The Netherlands kosters@liacs.nl pijls@few.eur.nl Abstract are kept in main memory, which might cause memory prob- lems: both are usually very large, and in particular the trie We will discuss  , the depth first implementation of gets much larger as the support threshold decreases. Finally A PRIORI as devised in 1999 (see [8]). Given a database, the algorithm outputs all paths in the trie, i.e., all frequent this algorithm builds a trie in memory that contains all fre- itemsets. Note that once completed, the trie allows for fast quent itemsets, i.e., all sets that are contained in at least itemset retrieval in the case of online processing. minsup transactions from the original database. Here min- We formerly had two implementations of the algorithm, sup is a threshold value given in advance. In the trie, that one being time efficient, the other being memory efficient is constructed by adding one item at a time, every path cor- (called dftime.cc and dfmemory.cc, respectively), responds to a unique frequent itemset. We describe the al- where the time efficient version could not handle low sup- gorithm in detail, derive theoretical formulas, and provide port thresholds. The newest version (called dffast.cc) experiments. combines them into one even faster implementation, and runs for all support thresholds. In this paper we first describe  , we then give some formal definitions and theoretical formulas, we discuss the 1 Introduction program, provide experimental results, and conclude with some remarks. In this paper we discuss the depth first (  , see [8]) implementation of A PRIORI (see [1]), one of the fastest known data mining algorithms to find all frequent item- 2 The Algorithm sets in a large database, i.e., all sets that are contained in at least minsup transactions from the original database. Here An appropriate data structure to store the frequent item- minsup is a threshold value given in advance. There exist sets of a given database is a trie. As a running example in many implementations of A PRIORI (see, e.g., [6, 11]). We this section we use the dataset of Figure 1. Each line rep- would like to focus on algorithms that assume that the whole resents a transaction. The trie of frequent patterns is shown database fits in main memory, this often being the state of in Figure 2. The entries (or cells) in a node of a trie are affairs; among these,  and  (the FP-growth imple- usually called buckets, as is also the case for a hash-tree. mentation of A PRIORI, see [5]) are the fastest. In most pa- Each bucket can be identified with its path to the root and pers so far little attention has been given to theoretical com- hence with a unique frequent itemset. The example trie has plexity. In [3, 7] a theoretical basis for the analysis of these 9 nodes and 18 buckets, representing 18 frequent itemsets. two algorithms was presented. As an example, the frequent itemset       can be The depth first algorithm is a simple algorithm that seen as the leftmost path in the trie; and a set as      proceeds as follows. After some preprocessing, which in- is not present. volves reading the database and a sorting of the single items One of the oldest algorithms for finding frequent patterns with respect to their support,  builds a trie in memory, is A PRIORI, see [1]. This algorithm successively finds all where every path from the root downwards corresponds to frequent 1-itemsets, all frequent 2-itemsets, all frequent 3- a unique frequent itemset; in consecutive steps items are itemsets, and so on. (A  -itemset has  items.) The frequent added to this trie one at a time. Both the database and the trie -itemsets are used to generate candidate   -itemsets, so on. So, A PRIORI can be thought of as an algorithm that Dataset builds the pattern trie in a breadth first way. We propose an algorithm that builds the trie in a depth first way. We will ex- transaction items plain the depth first construction of the trie using the dataset number of Figure 1. Note that the trie grows from right to left. 1 BCD The algorithm proceeds as follows. In a preprocessing 2 ABEF step, the support of each single item is counted and the in- frequent items are eliminated. Let the  frequent items be 3 ABEF denoted by       . Next, the code from Figure 3 is 4 ABCF 5 ABCEF executed. 6 CDEF (1)  the trie including only bucket  ; Frequent itemsets when minsup   (2) for     downto 1 do (3)  ; support frequent itemsets (4)  with  added to the left and 5 B, F a copy of appended to   ; 4 A, AB, AF, ABF, BF, C, E, EF (5)   (= the subtrie rooted in   ); 3 AE, ABE, ABEF, AEF, BC, (6) count   ; BE, BEF, CF (7) delete the infrequent itemsets from ; (9) procedure count     :: Figure 1. An example of a dataset along with (10) for every transaction including item   do its frequent itemsets. (11) for every itemset  in do (12) if supports  then  .support ; Figure 3. The algorithm. A B C E F     The procedure count    determines the support of   each itemset (bucket) in the subtrie . This is achieved by a database pass, in which each transaction including item B E F C E F F F  is considered. Any such transaction is one at a time   “pushed” through , where it only traverses a subtrie if it   includes the root of this subtrie, meanwhile updating the support fields in the buckets. In the last paragraph from Sec- tion 4 a refinement of this part of the algorithm is presented. E F F F  On termination of the algorithm, exactly contains the fre- quent itemsets.  Figure 4 illustrates the consecutive steps of the algorithm applied to our example. The single items surpassing the F minimum support threshold 3 are             and    . In the figure, the shape of after Figure 2. An example of a trie (without sup- each iteration of the for loop is shown. Also the infrequent port counts). itemsets to be deleted at the end of an iteration are men- tioned. At the start of the iteration with index , the root of trie consists of the 1-itemsets     . (We denote a 1-itemset by the name of its only item, omitting curly braces where the candidates are only known to have two frequent and commas as in Figure 1 and Figure 4.) By the statement subsets with  elements. After a pruning step, where can- in line (3) from Figure 3, this trie may also be referred to as didates still having infrequent subsets are discarded, the . A new trie is composed by adding bucket   to the support of the candidates is determined. The way A PRIORI root and by appending a copy of (the former value of ) finds the frequent patterns implies that the trie is built layer to  . The newly added buckets are the new candidates and by layer. First the nodes in the root (depth  ) are con- they make up a subtrie . In Figure 4, the candidate set is structed, next the trie nodes at depth 1 are constructed, and in the left part of each trie and is drawn in bold. Notice that number of database passes is one less than the num-    E F ber of frequent 1-itemsets. This causes the algorithm to be tractable only if the database under consideration is memory-resident. Given the present-day memory sizes, this is not a real constraint any more. F As stated above, our algorithm has a preprocessing step which counts the support for each single item. After this preprocessing step, the items may be re-ordered. The most    C E F  favorable execution time is achieved if we order the items  by increasing frequency (see Section 3 for a more formal  motivation). It is better to have low support at the top of the deeper side (to the left bottom) of the trie and hence, high E F F support at the top of the shallow part (to the upper right) of the trie. CE and CEF We may distinguish between “dense” data sets and are infrequent “sparse” datasets. A dense dataset has many frequent pat- F terns of large size and high support, as is the case for and hence deleted test sets such as chess and mushroom (see Section 5). In those datasets, many transactions are similar to each    B C E F  other. Datasets with mainly short patterns are called sparse.  Longer patterns may exist, but with relatively small sup-  F port. Real-world transaction databases of supermarkets mostly belong to this category. Also the synthetic datasets C E F F  from Section 5 have similar properties: interesting support  thresholds are much lower than in the dense case.  BCF is infrequent Algorithms for finding frequent patterns may be divided into two types: algorithms respectively with and without F F and hence deleted candidate generation Any A PRIORI-like instance belongs to the first type. Eclat (see [9]) may also be considered as an instance of this type. The FP-growth algorithm  from   A B C E F   [5] is the best-known instance of the second type (though one can also defend the point of view that it does generate   candidates). For dense datasets,  performs better than candidate generating algorithms.  stores the dataset in B C E F C E F F F   a way that is very efficient especially when the dataset has many similar transactions. In case of algorithms that do ap-   ply candidate generation, dense sets produce a large number of candidates. Since each new candidate has to be related C E F F F F  to each transaction, the database passes take a lot of time. However, for sparse datasets, candidate generation is a very  ABC, AC and ACF are infrequent suitable method for finding frequent patterns. To our expe- rience, the instances of the A PRIORI family are very useful F and hence deleted when searching transaction databases. According to the re- sults in [7] the depth first algorithm  outperforms FP- growth  in the synthetic transaction sets (see Section 5 for a description of these sets). Finally, note that technically speaking  is not a full Figure 4. Illustrating the algorithm. implementation of A PRIORI, since every candidate itemset the final trie (after deleting infrequent itemsets) is identical is known to have only one frequent subset (resulting from to Figure 2. the part of the trie which has already been completed) in- stead of two. Apart from this, its underlying candidate gen- The number of iterations in the for loop is one less eration mechanism strongly resembles the one from A PRI - than the number of frequent 1-itemsets. Consequently, the ORI . 3 Theoretical Complexity effort of maintaining and using the datastructures. Counting each trie-node with the number of buckets it contains, the Let denote the number of transactions (also called total is computed to be: customers), and let  denote the number of products (also   called items). Usually is much larger than . For a non-       empty itemset      we define: (3)          is the support of : the number of customers that buy all products from (and possibly more), or When the trie is finally ready the number of remaining buck- equivalently the number of transactions that contain ; ets equals the number of frequent sets, each item in a node    is the smallest number in ; being the end of the path that represents the corresponding itemset.    is the largest number in . Notice that the complexity heavily depends on the sort- ing order of the items at the top level. It turns out that an in- In line with this we let    . We also put     creasing order of items is beneficial here. This is suggested and      . A set      is called fre- by the contribution of the 1-itemsets in Equation (1): quent if     , where the so-called support threshold minsup is a fixed number given in advance. We assume every 1-itemset to be frequent; this can be       (4)  effected by the first step of the algorithms we are looking at, which might be considered as preprocessing. which happens to be minimal in that case. This 1-itemset contribution turns out to be the same for both  and  : A “database query” is defined as a question of the form see [3, 7], where also results for  are presented in more “Does customer  buy product  ?” (or “Does transaction detail. has item  ?”), posed to the original database. Note that we have  database queries in the “preprocessing” phase in which the supports of the 1-itemsets are computed and 4 Implementation Issues ordered: every field of the database is inspected once. (By the way, the sorting, in which the items are assigned the In this section we discuss some implementation details numbers    , takes    time.) The number of our program. As mentioned in Section 2, the database of database queries for  equals: is traversed many times. It is therefore necessary that the database is memory-resident. Fortunately, only the occur-   rences of frequent items need to be stored. The database           (1) is represented by a two-dimensional boolean array. For effi-   ciency reasons, one array entry corresponds to one bit. Since   the function count in the algorithm considers the database For a proof, see [3]. It relies on the fact that in order for a transaction by transaction, a horizontal layout is chosen, node to occur in the trie the path to it (except for the root) cf. [4]. should be frequent, and on the observation that this partic- We have four preprocessing steps before the algorithm ular node is “questioned” every time a transaction follows of Section 2 actually starts. this same path. In [3] a simple version of  is described in a similar style, leading to 1 The range of the item values is determined. This is nec- essary, because some test sets, e.g., the BMS-WebView sets, have only values   .   (2)     2 This is an essential initial step. First, for each item the       support is counted. Next, the frequent items are se- lected and sorted by frequency. This process is rele- database queries in “local databases” (FP-trees), except for vant, since the frequency order also prescribes the or- the preprocessing phase. Note the extra condition on the in- ner summation (which is “good” for  : we have less sum- der in the root of the trie, as stated before. The sorted frequent items along with their supports are retained in mands there), while on the other hand the summands are larger (which is “good” for  : we have a smaller contri- an array. bution there). 3 If a transaction has zero or one frequent item, it needs It makes also sense to look at the total number of nodes not to be stored into the memory-resident representa- of the trie during its construction, which is connected to the tion of the database. The root of the trie is constructed according to the information gathered in step 2. For itemsets. The number of items was set to   , fol- constructing the other buckets, only transactions with lowing the design in [2]. We use T10I4D100K (4.0 MB) at least two frequent items are relevant. In this step, we and T40I10D100K (15.5 MB), both also available from count the relevant transactions. the FIMI’03 website mentioned above; they both contain 100,000 transactions. 4 During this step the databases is stored into a two- dimensional array with horizontal layout. Each item is Database chess 1000 given a new number, according to its rank in the fre- execution time DF number of frequent sets (scale on right axis) 5 quency order. The length of the array equals the result of step 3; the width is determined by the number of 800 4 frequent items. number of sets in 1,000,000s runtime (seconds) 600 After this preparatory work, which in practice usually takes 3 a few seconds, the code as described in Section 2 is exe- 400 cuted. The cells of the root are constructed using the result 2 of initial step 2. 200 In line (12) from Figure 3 in Section 2, backtracking is 1 applied to inspect each path  of . Inspecting a path  is aborted as soon as an item  with  outside the current trans- 0 65 60 55 50 45 40 0 relative support (%) action is found. Obviously, processing one transaction dur- ing the count procedure is a relatively expensive task, which Figure 5. Experimental results for database is unfortunately inevitable, whichever version of A PRIORI is used. . As mentioned in the introduction, we used to have two implementations, one being time efficient, the other being Database mushroom memory efficient. These two have been used in the overall 200 execution time DF FIMI’03 comparisons. The newest implementation (called number of frequent sets (scale on right axis) 5 dffast.cc) combines these versions by using the follow- ing refinement. Instead of appending a copy of to  150 4 number of sets in 1,000,000s (see Figure 3 in Section 2), first the counting is done in aux- runtime (seconds) iliary fields in the original , after which only the frequent buckets are copied underneath   . This makes the dele- 3 100 tion of infrequent itemsets (line (7) from Figure 3) unnec- 2 essary and leads to better memory management. Another 50 improvement might be achieved by using more auxiliary 1 fields while adding two root items simultaneously in each iteration, thereby halving the number of database passes at 0 14 12 10 8 6 4 0 the cost of more bookkeeping. relative support (%) Figure 6. Experimental results for database 5 Experiments . Using the relatively small database chess (342 kB, with 3,196 transactions; available from the FIMI’03 website at The experiments were conducted at a Pentium-IV ma- http://fimi.cs.helsinki.fi/testdata.html), the chine with 512 MB memory at 2.8 GHz, running Red Hat database mushroom (570 kB, with 8,124 transactions; also Linux 7.3. The program was developed under the G NU C  available from the FIMI’03 website) and the well-known compiler, version 2.96. IBM-Almaden synthetic databases (see [2]) we shall exam- The following statistics are plotted in the graphs: the ex- ine the complexity of the algorithm. These databases have ecution time in seconds of the algorithm (see Section 4), either few, but coherent records (chess and mushroom), and the total number of frequent itemsets: in all figures the or many records (the synthetic databases). The parameters corresponding axis is on the right hand side and scales 0– for generating a synthetic database are the number of trans- 5,500,000 (0–8,000,000 for  ). The execution actions  (in thousands), the average transaction size and time excludes preprocessing: in this phase the database is the average length  of so-called maximal potentially large read three times in order to detect the frequent items (see 100 Database T10I4D100K 8 minsup = 300, having 5,058,313 frequent sets, with 21 fre- execution time DF number of frequent sets (scale on right axis) quent 19-itemsets and 1 frequent 20-itemset). The final trie 7 in the T40I10D100K case occupies approximately 65 MB 80 6 of memory — the output file in this case being 3 times as number of sets in 1,000,000s large. 5 runtime (seconds) 60 Note that the 3,457,747 sets for the chess database 4 with minsup = 1,400 require 829 seconds to find, whereas 40 3 the 3,771,728 frequent itemsets for mushroom with min- sup = 400 take 158 seconds — differing approximately 2 20 a factor 5 in time. This difference in runtime is probably 1 caused by the difference in the absolute minsup value. Each 0 0 cell corresponding to a frequent itemset is visited at least 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 relative support (%) 1400 times in the former case against 400 times in the lat- ter case. A similar phenomenon is observed when compar- Figure 7. Experimental results for database ing T40I10D100K with absolute minsup value 300 and  . T10I4D100K with minsup = 3: this takes 378 versus 88 seconds. Although the outputs have the same orders of mag- nitude, the runtimes differ substantially. We see that, besides 500 Database T40I10D100K the number of frequent itemsets and the sizes of these sets, execution time DF 450 number of frequent sets (scale on right axis) 5 the absolute minsup value is a major factor determining the runtime. 400 4 number of sets in 1,000,000s 350 6 Conclusions runtime (seconds) 300 3 250 200 In this paper, we addressed  , a depth first implemen- tation of A PRIORI. To our experience,  competes with 2 150 100 1 any other well-known algorithm, especially when applied to 50 large databases with transactions. 0 0 Storing the database in the primary memory is no longer 2 1.5 1 relative support (%) 0.5 0 a problem. On the other hand, storing the candidates causes trouble in situations, where a dense database is considered Figure 8. Experimental results for database with a small support threshold. This is the case for any al-  . gorithm using candidates. Therefore, it would be desirable to look for a method which stores candidates in secondary memory. This is an obvious topic for future research. To our knowledge,  is the only algorithm that can cope before); also excluded is the time needed to print the re- with memory limitations. However, for real world retail sulting itemsets. These actions together usually only take a databases this algorithm is surpassed by  , as we showed few seconds. The number of frequent 1-itemsets ( from in [7]. Other optimizations might also be possible. Besides the previous sections, where we assumed all 1-itemsets improving the C  code, ideas from, e.g., [10] on diffsets to be frequent) has range 31–39 for the experiments on with vertical layouts might be used. the database chess, 54–76 for mushroom, 844–869 for Our conclusion is that  is a simple, practical, straight- T10I4D100K and 610–862 for T40I10D100K. Note the forward and fast algorithm for finding all frequent itemsets. very high support thresholds for mushroom (at least 5%) and chess (at least 44%); for T10I4D100K a support threshold as low as 0.003% was even feasible. References The largest output files produced are of size 110.6 MB (chess, minsup = 1,400, having 3,771,728 frequent sets [1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and with 13 frequent 17-itemsets), 121.5 MB (mushroom, min- A.I. Verkamo. Fast discovery of association rules. sup = 400, having 3,457,747 frequent sets with 24 frequent In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and 17-itemsets), 131.1 MB (T10I4D100K, minsup = 3, hav- R. Uthurusamy, editors, Advances in Knowledge Dis- ing 6,169,854 frequent sets with 30 frequent 13-itemsets covery and Data Mining, pages 307–328. AAAI/MIT and 1 frequent 14-itemset) and 195.9 MB (T40I10D100K, Press, 1996. [2] R. Agrawal and R. Srikant. Fast algorithms for mining and R. Srikant, editors, Proceedings of the Seventh association rules. In J.B. Bocca, M. Jarke, and C. Zan- ACM SIGKDD International Conference on Knowl- iolo, editors, Proceedings 20th International Confer- edge Discovery and Data Mining (KDD-2001), pages ence on Very Large Data Bases, VLDB, pages 487– 401–406, 2001. 499. Morgan Kaufmann, 1994. [3] J.M. de Graaf, W.A. Kosters, W. Pijls, and V. Popova. A theoretical and practical comparison of depth first and FP-growth implementations of Apriori. In H. Blockeel and M. Denecker, editors, Proceedings of the Fourteenth Belgium-Netherlands Artificial In- telligence Conference (BNAIC 2002), pages 115–122, 2002. [4] B. Goethals. Survey on frequent pattern mining. Helsinki, 2003.      -   . [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings 2000 ACM SIGMOD International Conference on Manage- ment of Data (SIGMOD’00), pages 1–12, 2000. [6] J. Hipp, U. Günther, and G. Nakhaeizadeh. Mining as- sociation rules: Deriving a superior algorithm by an- alyzing today’s approaches. In D.A. Zighed, J. Ko- morowski, and J. Żytkov, editors, Principles of Data Mining and Knowledge Discovery, Proceedings of the 4th European Conference (PKDD 2000), Springer Lecture Notes in Computer Science 1910, pages 159– 168. Springer Verlag, 2000. [7] W.A. Kosters, W. Pijls, and V. Popova. Complex- ity analysis of depth first and FP-growth implemen- tations of Apriori. In P. Perner and A. Rosenfeld, ed- itors, Machine Learning and Data Mining in Pattern Recognition, Proceedings MLDM 2003, Springer Lec- ture Notes in Artificial Intelligence 2734, pages 284– 292. Springer Verlag, 2003. [8] W. Pijls and J.C. Bioch. Mining frequent item- sets in memory-resident databases. In E. Postma and M. Gyssens, editors, Proceedings of the Eleventh Belgium-Netherlands Conference on Artificial Intelli- gence (BNAIC1999), pages 75–82, 1999. [9] M.J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineer- ing, 12:372–390, 2000. [10] M.J. Zaki and K. Gouda. Fast vertical mining using diffsets. In Proceedings 9th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, 2003. [11] Z. Zheng, R. Kohavi, and L. Mason. Real world per- formance of association rule algorithms. In F. Provost