=Paper=
{{Paper
|id=Vol-90/paper-12
|storemode=property
|title=Apriori, A Depth First Implementation
|pdfUrl=https://ceur-ws.org/Vol-90/kosters.pdf
|volume=Vol-90
|dblpUrl=https://dblp.org/rec/conf/fimi/KostersP03
}}
==Apriori, A Depth First Implementation==
APRIORI, A Depth First Implementation
Walter A. Kosters Wim Pijls
Leiden Institute of Advanced Computer Science Department of Computer Science
Universiteit Leiden Erasmus University
P.O. Box 9512, 2300 RA Leiden P.O. Box 1738, 3000 DR Rotterdam
The Netherlands The Netherlands
kosters@liacs.nl pijls@few.eur.nl
Abstract are kept in main memory, which might cause memory prob-
lems: both are usually very large, and in particular the trie
We will discuss , the depth first implementation of gets much larger as the support threshold decreases. Finally
A PRIORI as devised in 1999 (see [8]). Given a database, the algorithm outputs all paths in the trie, i.e., all frequent
this algorithm builds a trie in memory that contains all fre- itemsets. Note that once completed, the trie allows for fast
quent itemsets, i.e., all sets that are contained in at least itemset retrieval in the case of online processing.
minsup transactions from the original database. Here min- We formerly had two implementations of the algorithm,
sup is a threshold value given in advance. In the trie, that one being time efficient, the other being memory efficient
is constructed by adding one item at a time, every path cor- (called dftime.cc and dfmemory.cc, respectively),
responds to a unique frequent itemset. We describe the al- where the time efficient version could not handle low sup-
gorithm in detail, derive theoretical formulas, and provide port thresholds. The newest version (called dffast.cc)
experiments. combines them into one even faster implementation, and
runs for all support thresholds.
In this paper we first describe , we then give some
formal definitions and theoretical formulas, we discuss the
1 Introduction program, provide experimental results, and conclude with
some remarks.
In this paper we discuss the depth first ( , see [8])
implementation of A PRIORI (see [1]), one of the fastest
known data mining algorithms to find all frequent item- 2 The Algorithm
sets in a large database, i.e., all sets that are contained in at
least minsup transactions from the original database. Here An appropriate data structure to store the frequent item-
minsup is a threshold value given in advance. There exist sets of a given database is a trie. As a running example in
many implementations of A PRIORI (see, e.g., [6, 11]). We this section we use the dataset of Figure 1. Each line rep-
would like to focus on algorithms that assume that the whole resents a transaction. The trie of frequent patterns is shown
database fits in main memory, this often being the state of in Figure 2. The entries (or cells) in a node of a trie are
affairs; among these, and (the FP-growth imple- usually called buckets, as is also the case for a hash-tree.
mentation of A PRIORI, see [5]) are the fastest. In most pa- Each bucket can be identified with its path to the root and
pers so far little attention has been given to theoretical com- hence with a unique frequent itemset. The example trie has
plexity. In [3, 7] a theoretical basis for the analysis of these 9 nodes and 18 buckets, representing 18 frequent itemsets.
two algorithms was presented. As an example, the frequent itemset can be
The depth first algorithm is a simple algorithm that seen as the leftmost path in the trie; and a set as
proceeds as follows. After some preprocessing, which in- is not present.
volves reading the database and a sorting of the single items One of the oldest algorithms for finding frequent patterns
with respect to their support, builds a trie in memory, is A PRIORI, see [1]. This algorithm successively finds all
where every path from the root downwards corresponds to frequent 1-itemsets, all frequent 2-itemsets, all frequent 3-
a unique frequent itemset; in consecutive steps items are itemsets, and so on. (A -itemset has items.) The frequent
added to this trie one at a time. Both the database and the trie -itemsets are used to generate candidate -itemsets,
so on. So, A PRIORI can be thought of as an algorithm that
Dataset builds the pattern trie in a breadth first way. We propose an
algorithm that builds the trie in a depth first way. We will ex-
transaction items plain the depth first construction of the trie using the dataset
number of Figure 1. Note that the trie grows from right to left.
1 BCD
The algorithm proceeds as follows. In a preprocessing
2 ABEF
step, the support of each single item is counted and the in-
frequent items are eliminated. Let the frequent items be
3 ABEF
denoted by . Next, the code from Figure 3 is
4 ABCF
5 ABCEF
executed.
6 CDEF
(1) the trie including only bucket ;
Frequent itemsets when minsup (2) for downto 1 do
(3) ;
support frequent itemsets (4) with added to the left and
5 B, F a copy of appended to ;
4 A, AB, AF, ABF, BF, C, E, EF (5) (= the subtrie rooted in );
3 AE, ABE, ABEF, AEF, BC, (6) count ;
BE, BEF, CF (7) delete the infrequent itemsets from ;
(9) procedure count ::
Figure 1. An example of a dataset along with (10) for every transaction including item do
its frequent itemsets. (11) for every itemset in do
(12) if supports then .support ;
Figure 3. The algorithm.
A B C E F
The procedure count determines the support of
each itemset (bucket) in the subtrie . This is achieved by
a database pass, in which each transaction including item
B E F C E F F F is considered. Any such transaction is one at a time
“pushed” through , where it only traverses a subtrie if it
includes the root of this subtrie, meanwhile updating the
support fields in the buckets. In the last paragraph from Sec-
tion 4 a refinement of this part of the algorithm is presented.
E F F F
On termination of the algorithm, exactly contains the fre-
quent itemsets.
Figure 4 illustrates the consecutive steps of the algorithm
applied to our example. The single items surpassing the
F minimum support threshold 3 are
and . In the figure, the shape of after
Figure 2. An example of a trie (without sup- each iteration of the for loop is shown. Also the infrequent
port counts). itemsets to be deleted at the end of an iteration are men-
tioned. At the start of the iteration with index , the root of
trie consists of the 1-itemsets . (We denote a
1-itemset by the name of its only item, omitting curly braces
where the candidates are only known to have two frequent and commas as in Figure 1 and Figure 4.) By the statement
subsets with elements. After a pruning step, where can- in line (3) from Figure 3, this trie may also be referred to as
didates still having infrequent subsets are discarded, the . A new trie is composed by adding bucket to the
support of the candidates is determined. The way A PRIORI root and by appending a copy of (the former value of )
finds the frequent patterns implies that the trie is built layer to . The newly added buckets are the new candidates and
by layer. First the nodes in the root (depth ) are con- they make up a subtrie . In Figure 4, the candidate set is
structed, next the trie nodes at depth 1 are constructed, and in the left part of each trie and is drawn in bold. Notice that
number of database passes is one less than the num-
E F ber of frequent 1-itemsets. This causes the algorithm to
be tractable only if the database under consideration is
memory-resident. Given the present-day memory sizes, this
is not a real constraint any more.
F As stated above, our algorithm has a preprocessing step
which counts the support for each single item. After this
preprocessing step, the items may be re-ordered. The most
C E F
favorable execution time is achieved if we order the items
by increasing frequency (see Section 3 for a more formal
motivation). It is better to have low support at the top of the
deeper side (to the left bottom) of the trie and hence, high
E F F support at the top of the shallow part (to the upper right) of
the trie.
CE and CEF We may distinguish between “dense” data sets and
are infrequent “sparse” datasets. A dense dataset has many frequent pat-
F terns of large size and high support, as is the case for
and hence deleted
test sets such as chess and mushroom (see Section 5).
In those datasets, many transactions are similar to each
B C E F
other. Datasets with mainly short patterns are called sparse.
Longer patterns may exist, but with relatively small sup-
F
port. Real-world transaction databases of supermarkets
mostly belong to this category. Also the synthetic datasets
C E F F
from Section 5 have similar properties: interesting support
thresholds are much lower than in the dense case.
BCF is infrequent Algorithms for finding frequent patterns may be divided
into two types: algorithms respectively with and without
F F and hence deleted candidate generation Any A PRIORI-like instance belongs to
the first type. Eclat (see [9]) may also be considered as an
instance of this type. The FP-growth algorithm from
A B C E F
[5] is the best-known instance of the second type (though
one can also defend the point of view that it does generate
candidates). For dense datasets, performs better than
candidate generating algorithms. stores the dataset in
B C E F C E F F F
a way that is very efficient especially when the dataset has
many similar transactions. In case of algorithms that do ap-
ply candidate generation, dense sets produce a large number
of candidates. Since each new candidate has to be related
C E F F F F
to each transaction, the database passes take a lot of time.
However, for sparse datasets, candidate generation is a very
ABC, AC and ACF are infrequent suitable method for finding frequent patterns. To our expe-
rience, the instances of the A PRIORI family are very useful
F and hence deleted when searching transaction databases. According to the re-
sults in [7] the depth first algorithm outperforms FP-
growth in the synthetic transaction sets (see Section 5
for a description of these sets).
Finally, note that technically speaking is not a full
Figure 4. Illustrating the algorithm.
implementation of A PRIORI, since every candidate itemset
the final trie (after deleting infrequent itemsets) is identical is known to have only one frequent subset (resulting from
to Figure 2. the part of the trie which has already been completed) in-
stead of two. Apart from this, its underlying candidate gen-
The number of iterations in the for loop is one less eration mechanism strongly resembles the one from A PRI -
than the number of frequent 1-itemsets. Consequently, the ORI .
3 Theoretical Complexity effort of maintaining and using the datastructures. Counting
each trie-node with the number of buckets it contains, the
Let denote the number of transactions (also called total is computed to be:
customers), and let denote the number of products (also
called items). Usually is much larger than . For a non-
empty itemset we define:
(3)
is the support of : the number of customers
that buy all products from (and possibly more), or When the trie is finally ready the number of remaining buck-
equivalently the number of transactions that contain ; ets equals the number of frequent sets, each item in a node
is the smallest number in ; being the end of the path that represents the corresponding
itemset.
is the largest number in . Notice that the complexity heavily depends on the sort-
ing order of the items at the top level. It turns out that an in-
In line with this we let . We also put creasing order of items is beneficial here. This is suggested
and . A set is called fre- by the contribution of the 1-itemsets in Equation (1):
quent if , where the so-called support
threshold minsup is a fixed number given in advance.
We assume every 1-itemset to be frequent; this can be
(4)
effected by the first step of the algorithms we are looking
at, which might be considered as preprocessing. which happens to be minimal in that case. This 1-itemset
contribution turns out to be the same for both and :
A “database query” is defined as a question of the form see [3, 7], where also results for are presented in more
“Does customer buy product ?” (or “Does transaction detail.
has item ?”), posed to the original database. Note that
we have database queries in the “preprocessing” phase
in which the supports of the 1-itemsets are computed and
4 Implementation Issues
ordered: every field of the database is inspected once. (By
the way, the sorting, in which the items are assigned the In this section we discuss some implementation details
numbers , takes time.) The number of our program. As mentioned in Section 2, the database
of database queries for equals: is traversed many times. It is therefore necessary that the
database is memory-resident. Fortunately, only the occur-
rences of frequent items need to be stored. The database
(1) is represented by a two-dimensional boolean array. For effi-
ciency reasons, one array entry corresponds to one bit. Since
the function count in the algorithm considers the database
For a proof, see [3]. It relies on the fact that in order for a transaction by transaction, a horizontal layout is chosen,
node to occur in the trie the path to it (except for the root) cf. [4].
should be frequent, and on the observation that this partic- We have four preprocessing steps before the algorithm
ular node is “questioned” every time a transaction follows of Section 2 actually starts.
this same path. In [3] a simple version of is described
in a similar style, leading to 1 The range of the item values is determined. This is nec-
essary, because some test sets, e.g., the BMS-WebView
sets, have only values .
(2)
2 This is an essential initial step. First, for each item the
support is counted. Next, the frequent items are se-
lected and sorted by frequency. This process is rele-
database queries in “local databases” (FP-trees), except for
vant, since the frequency order also prescribes the or-
the preprocessing phase. Note the extra condition on the in-
ner summation (which is “good” for : we have less sum-
der in the root of the trie, as stated before. The sorted
frequent items along with their supports are retained in
mands there), while on the other hand the summands are
larger (which is “good” for : we have a smaller contri-
an array.
bution there). 3 If a transaction has zero or one frequent item, it needs
It makes also sense to look at the total number of nodes not to be stored into the memory-resident representa-
of the trie during its construction, which is connected to the tion of the database. The root of the trie is constructed
according to the information gathered in step 2. For itemsets. The number of items was set to , fol-
constructing the other buckets, only transactions with lowing the design in [2]. We use T10I4D100K (4.0 MB)
at least two frequent items are relevant. In this step, we and T40I10D100K (15.5 MB), both also available from
count the relevant transactions. the FIMI’03 website mentioned above; they both contain
100,000 transactions.
4 During this step the databases is stored into a two-
dimensional array with horizontal layout. Each item is Database chess
1000
given a new number, according to its rank in the fre- execution time DF
number of frequent sets (scale on right axis)
5
quency order. The length of the array equals the result
of step 3; the width is determined by the number of 800
4
frequent items.
number of sets in 1,000,000s
runtime (seconds)
600
After this preparatory work, which in practice usually takes 3
a few seconds, the code as described in Section 2 is exe-
400
cuted. The cells of the root are constructed using the result 2
of initial step 2.
200
In line (12) from Figure 3 in Section 2, backtracking is 1
applied to inspect each path of . Inspecting a path is
aborted as soon as an item with outside the current trans- 0
65 60 55 50 45 40
0
relative support (%)
action is found. Obviously, processing one transaction dur-
ing the count procedure is a relatively expensive task, which
Figure 5. Experimental results for database
is unfortunately inevitable, whichever version of A PRIORI
is used.
.
As mentioned in the introduction, we used to have two
implementations, one being time efficient, the other being
Database mushroom
memory efficient. These two have been used in the overall 200
execution time DF
FIMI’03 comparisons. The newest implementation (called number of frequent sets (scale on right axis)
5
dffast.cc) combines these versions by using the follow-
ing refinement. Instead of appending a copy of to 150
4
number of sets in 1,000,000s
(see Figure 3 in Section 2), first the counting is done in aux-
runtime (seconds)
iliary fields in the original , after which only the frequent
buckets are copied underneath . This makes the dele-
3
100
tion of infrequent itemsets (line (7) from Figure 3) unnec- 2
essary and leads to better memory management. Another 50
improvement might be achieved by using more auxiliary 1
fields while adding two root items simultaneously in each
iteration, thereby halving the number of database passes at 0
14 12 10 8 6 4
0
the cost of more bookkeeping. relative support (%)
Figure 6. Experimental results for database
5 Experiments .
Using the relatively small database chess (342 kB, with
3,196 transactions; available from the FIMI’03 website at The experiments were conducted at a Pentium-IV ma-
http://fimi.cs.helsinki.fi/testdata.html), the chine with 512 MB memory at 2.8 GHz, running Red Hat
database mushroom (570 kB, with 8,124 transactions; also Linux 7.3. The program was developed under the G NU C
available from the FIMI’03 website) and the well-known compiler, version 2.96.
IBM-Almaden synthetic databases (see [2]) we shall exam- The following statistics are plotted in the graphs: the ex-
ine the complexity of the algorithm. These databases have ecution time in seconds of the algorithm (see Section 4),
either few, but coherent records (chess and mushroom), and the total number of frequent itemsets: in all figures the
or many records (the synthetic databases). The parameters corresponding axis is on the right hand side and scales 0–
for generating a synthetic database are the number of trans- 5,500,000 (0–8,000,000 for ). The execution
actions (in thousands), the average transaction size and time excludes preprocessing: in this phase the database is
the average length of so-called maximal potentially large read three times in order to detect the frequent items (see
100
Database T10I4D100K
8
minsup = 300, having 5,058,313 frequent sets, with 21 fre-
execution time DF
number of frequent sets (scale on right axis) quent 19-itemsets and 1 frequent 20-itemset). The final trie
7
in the T40I10D100K case occupies approximately 65 MB
80
6 of memory — the output file in this case being 3 times as
number of sets in 1,000,000s
large.
5
runtime (seconds)
60
Note that the 3,457,747 sets for the chess database
4
with minsup = 1,400 require 829 seconds to find, whereas
40
3 the 3,771,728 frequent itemsets for mushroom with min-
sup = 400 take 158 seconds — differing approximately
2
20 a factor 5 in time. This difference in runtime is probably
1 caused by the difference in the absolute minsup value. Each
0 0
cell corresponding to a frequent itemset is visited at least
0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
relative support (%) 1400 times in the former case against 400 times in the lat-
ter case. A similar phenomenon is observed when compar-
Figure 7. Experimental results for database ing T40I10D100K with absolute minsup value 300 and
. T10I4D100K with minsup = 3: this takes 378 versus 88
seconds. Although the outputs have the same orders of mag-
nitude, the runtimes differ substantially. We see that, besides
500
Database T40I10D100K
the number of frequent itemsets and the sizes of these sets,
execution time DF
450
number of frequent sets (scale on right axis)
5
the absolute minsup value is a major factor determining the
runtime.
400
4
number of sets in 1,000,000s
350
6 Conclusions
runtime (seconds)
300
3
250
200 In this paper, we addressed , a depth first implemen-
tation of A PRIORI. To our experience, competes with
2
150
100
1
any other well-known algorithm, especially when applied to
50
large databases with transactions.
0 0
Storing the database in the primary memory is no longer
2 1.5 1
relative support (%)
0.5 0
a problem. On the other hand, storing the candidates causes
trouble in situations, where a dense database is considered
Figure 8. Experimental results for database with a small support threshold. This is the case for any al-
. gorithm using candidates. Therefore, it would be desirable
to look for a method which stores candidates in secondary
memory. This is an obvious topic for future research. To
our knowledge, is the only algorithm that can cope
before); also excluded is the time needed to print the re- with memory limitations. However, for real world retail
sulting itemsets. These actions together usually only take a databases this algorithm is surpassed by , as we showed
few seconds. The number of frequent 1-itemsets ( from in [7]. Other optimizations might also be possible. Besides
the previous sections, where we assumed all 1-itemsets improving the C code, ideas from, e.g., [10] on diffsets
to be frequent) has range 31–39 for the experiments on with vertical layouts might be used.
the database chess, 54–76 for mushroom, 844–869 for Our conclusion is that is a simple, practical, straight-
T10I4D100K and 610–862 for T40I10D100K. Note the forward and fast algorithm for finding all frequent itemsets.
very high support thresholds for mushroom (at least 5%)
and chess (at least 44%); for T10I4D100K a support
threshold as low as 0.003% was even feasible. References
The largest output files produced are of size 110.6 MB
(chess, minsup = 1,400, having 3,771,728 frequent sets [1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and
with 13 frequent 17-itemsets), 121.5 MB (mushroom, min- A.I. Verkamo. Fast discovery of association rules.
sup = 400, having 3,457,747 frequent sets with 24 frequent In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
17-itemsets), 131.1 MB (T10I4D100K, minsup = 3, hav- R. Uthurusamy, editors, Advances in Knowledge Dis-
ing 6,169,854 frequent sets with 30 frequent 13-itemsets covery and Data Mining, pages 307–328. AAAI/MIT
and 1 frequent 14-itemset) and 195.9 MB (T40I10D100K, Press, 1996.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining and R. Srikant, editors, Proceedings of the Seventh
association rules. In J.B. Bocca, M. Jarke, and C. Zan- ACM SIGKDD International Conference on Knowl-
iolo, editors, Proceedings 20th International Confer- edge Discovery and Data Mining (KDD-2001), pages
ence on Very Large Data Bases, VLDB, pages 487– 401–406, 2001.
499. Morgan Kaufmann, 1994.
[3] J.M. de Graaf, W.A. Kosters, W. Pijls, and V. Popova.
A theoretical and practical comparison of depth first
and FP-growth implementations of Apriori. In
H. Blockeel and M. Denecker, editors, Proceedings
of the Fourteenth Belgium-Netherlands Artificial In-
telligence Conference (BNAIC 2002), pages 115–122,
2002.
[4] B. Goethals. Survey on frequent pattern mining.
Helsinki, 2003. -
.
[5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In Proceedings 2000
ACM SIGMOD International Conference on Manage-
ment of Data (SIGMOD’00), pages 1–12, 2000.
[6] J. Hipp, U. Günther, and G. Nakhaeizadeh. Mining as-
sociation rules: Deriving a superior algorithm by an-
alyzing today’s approaches. In D.A. Zighed, J. Ko-
morowski, and J. Żytkov, editors, Principles of Data
Mining and Knowledge Discovery, Proceedings of
the 4th European Conference (PKDD 2000), Springer
Lecture Notes in Computer Science 1910, pages 159–
168. Springer Verlag, 2000.
[7] W.A. Kosters, W. Pijls, and V. Popova. Complex-
ity analysis of depth first and FP-growth implemen-
tations of Apriori. In P. Perner and A. Rosenfeld, ed-
itors, Machine Learning and Data Mining in Pattern
Recognition, Proceedings MLDM 2003, Springer Lec-
ture Notes in Artificial Intelligence 2734, pages 284–
292. Springer Verlag, 2003.
[8] W. Pijls and J.C. Bioch. Mining frequent item-
sets in memory-resident databases. In E. Postma
and M. Gyssens, editors, Proceedings of the Eleventh
Belgium-Netherlands Conference on Artificial Intelli-
gence (BNAIC1999), pages 75–82, 1999.
[9] M.J. Zaki. Scalable algorithms for association mining.
IEEE Transactions on Knowledge and Data Engineer-
ing, 12:372–390, 2000.
[10] M.J. Zaki and K. Gouda. Fast vertical mining using
diffsets. In Proceedings 9th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, 2003.
[11] Z. Zheng, R. Kohavi, and L. Mason. Real world per-
formance of association rule algorithms. In F. Provost