=Paper=
{{Paper
|id=Vol-126/paper-1
|storemode=property
|title=Surprising Results of Trie-based FIM Algorithms
|pdfUrl=https://ceur-ws.org/Vol-126/bodon.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/Bodon04
}}
==Surprising Results of Trie-based FIM Algorithms==
Surprising results of trie-based FIM algorithms
Ferenc Bodon∗
bodon@cs.bme.hu
Department of Computer Science and Information Theory,
Budapest University of Technology and Economics
Abstract in the case of APRIORI and FP-growth. Therefore it is use-
ful and instructive to analyze tries, and clarify those details
that have an effect on run-time or memory need. In this pa-
Trie is a popular data structure in frequent itemset mining per we will see, that small details can have a large influence
(FIM) algorithms. It is memory-efficient, and allows fast on efficiency and taking a closer look at them brings up new
construction and information retrieval. Many trie-related questions and new solutions.
techniques can be applied in FIM algorithms to improve The rest of the paper is organized as follows. The prob-
efficiency. In this paper we propose new techniques for lem is presented in Section 2, trie and its role in FIM is
fast management, but more importantly we scrutinize the described in Section 3. Section 4 introduces accelerating
well-known ones especially those which can be employed techniques one-by-one. Surprising experimental results and
in APRIORI. The theoretical claims are supported by re- the explanations are given in Section 5.
sults of a comprehensive set of experiments, based on hun-
dreds of tests that were performed on numerous databases, 2. Problem statement
with different support thresholds. We offer some surpris-
ing conclusions, which at some point contradict published
Frequent itemset mining is a special case of frequent pat-
claims.
tern mining. Let us first describe this general case. We
assume that the reader is familiar with the basics of poset
1. Introduction theory. We call a poset (P, ) locally finite, if every inter-
val [x, y] is finite, i.e. the number of elements z, such that
Frequent itemset mining is the most researched field of x z y is finite. The element x covers y, if y x and
frequent pattern mining. Techniques and algorithms devel- for any y z, z x.
oped here are often used in search for other types of patterns Definition 1. We call the poset PC = (P, ) pattern con-
(like sequences, rooted trees, boolean formulas, graphs). text, if there exists exactly one minimal element, PC is lo-
The original problem was to discover association rules [2], cally finite and graded, i.e. there exists a size function
where the main step was to find frequently occurring item- | | : P → Z, such that |p| = |p | + 1, if p covers p . The
sets. Over one hundred FIM algorithms were proposed – elements of P are called patterns and P is called the pattern
the majority claiming to be the most efficient. The truth space1 or pattern set.
is that no single most efficient algorithm exists; there is no
published method that outperforms every other method on Without loss of generality we assume that the size of the
every dataset with every support threshold [11]. However, minimal pattern is 0 and it is called the empty pattern.
there are three algorithms that play central role due to their In the frequent pattern mining problem we are given the
efficiency and the fact that many algorithms are modifica- set of input data T, the pattern context PC = (P, ), the
tions or combinations of these basic methods. These algo- anti-monotonic function supp T : P → N and min supp ∈
rithms are APRIORI [3], Eclat [25] and FP-growth [12]. N. We have to find the set F = {p ∈ P : supp T (p) ≥
Those who employed one of the basic algorithms as a 1 Many researchers improperly refer to the pattern space as the pattern
search strategy, tended to employ the whole set of proce- lattice. It may come from the fact that patterns were first examined as sets
dures and data structures as well, which is trie (prefix tree) of items [2], when (P, ) actually formed a lattice. However this property
does not hold for many other types of patterns. It is easy to prove that if the
∗ Research is partially supported by the Hungarian National Science type of pattern is sequence, boolean formula or graph, then the least upper
Foundation (Grant No. OTKA TS-044733, T42706, T42481). bound is not unique.
1
min supp} and the support of the patterns in F . Elements
of F are called frequent patterns, supp T is the support func-
tion and min supp is referred to as support threshold. F
E
There are many types of patterns: itemsets [2], item se- A
C
quences, sequences of itemsets [4], episodes [16], boolean
formulas [15], rooted labeled ordered/unordered trees [24],
F
labeled induced subgraphs [13], labeled subgraphs [14]. In E F
C
frequent itemset mining [2] the pattern context is (2 I , ⊆),
where I is a given set, ⊆ is the usual subset relation, and the
input data is a sequence of transactions (T = t 1 , . . . , tm ). F
The elements of I are called items and each transaction is
a set of items. The support of itemset I is the number of
transactions that contain I as a subset.
There exist many algorithms which efficiently solve the Figure 1. Example: a trie
FIM problem. Most of them are APRIORI and FP-growth
based where efficiency comes from the sophisticated use of
the trie data structure. The goal of our work is to scruti-
edges of a node are stored in a linked list.
nize the use of trie theoretically and experimentally. Our
claims are supported by hundreds of experiments based on In the non compact representation (also called tabular
many different databases with various support thresholds. implementation by Fredkin) only the pointers are stored in
We believe that a result of our work was to clarify impor- a vector with a length equal to that of the alphabet (fre-
tant technical details of APRIORI, and to take some steps quent items in our case). An element at index i belongs
towards finding the best implementation. to the edge whose label is the i th item. If there is no edge
with such a label, then the element is NIL. This solution
3. Tries has the advantage of finding an edge with a given label in
O(1) time, instead of O(log n) required by a binary search –
which is the case in compact representation. Unfortunately
The data structure trie was originally introduced by de for nodes with few edges this representation requires some
la Briandais [9] and Fredkin [10] to store and efficiently more memory than compact representation. On the con-
retrieve words of a dictionary. A trie is a rooted, labeled trary, if a node has many edges (exact formula can be given
tree. The root is defined to be at depth 0, and a node at based on the memory need of pointers and labels), then the
depth d can point to nodes at depth d + 1. A pointer is also non-compact representation needs less memory since labels
referred to as edge or link. If node u points to node v, then are not stored explicitly. According to the memory need the
we call u the parent of v, and v is a child node of u. For two approaches can be combined [21], [23], [6]. If there are
the sake of efficiency – concerning insertion and deletion – many nodes with single edges, then further memory can be
a total order on the labels of edges has to be defined. saved by using patricia tries. Throughout this paper and in
Tries are suitable for storing and retrieving not only the implementations compact representation of tries is used.
words, but any finite sets (or sequences). In FIM algorithms
tries (also called a lexicographic tree) are used to quickly Tries are used in FIM algorithms in two ways. In APRI-
determine the support of itemsets whose size is greater than ORI based algorithms the tries store candidates (itemsets
2. In the FIM setting a link is labeled by a frequent item, and whose support has to be determined), and in APRIORI and
a node represents an itemset, which is the set of the items in FP-growth based algorithms the input sequence (more pre-
the path from the root to the leaf. The label of a node stores cisely a projection of the input) is stored in a trie.
the counter of the itemset that the node represents.
Figure 1 presents a trie (without the counters) that stores
the itemsets {A}, {C}, {E}, {F}, {A,C}, {A,E}, {A,F},
4. Techniques for fast management
{E,F}, {A,E,F}. Building a trie is straightforward; we omit
the details.
Tries can be implemented in many ways. In compact
representation the edges of a node are stored in a vector. In this section we take trie issues one-by-one, describe
Each element of a vector is a pair; the first element stores the problem and present previous claims or naive expecta-
the label of the edge, the second stores the address of the tions. Some issues apply to APRIORI and FP-growth based
node, which the edge points to. This solution is very similar algorithms, but some apply only to the first algorithm fam-
to the widespread “doubly chained” representation, where ily.
2
4.1. Storing the transactions
0 0
Let us call the itemset that is obtained by removing infre- C
A D
quent items from t the filtered transaction of t. All frequent E
itemsets can be determined even if only filtered transactions 1 1 2 3
are available. To reduce IO cost and speed up the algorithm, C
C B B
the filtered transactions can be stored in main memory in- B
stead of on disk. It is useless to store the same filtered trans- 2 3 4 5 6
actions multiple times. Instead store them once and employ D
C E A A A
counters which store the multiplicities. This way memory
is saved and run-time can be significantly improved. 4 5 6 7 8 9
This fact is used in FP-growth and can be used in APRI-
ORI as well. In FP-growth the filtered transactions are
stored in an FP-tree, which is a trie with cross-links and Figure 2. Example: Tries with different orders
a header table. Size of the FP-tree that stores filtered trans-
actions is declared to be “substantially smaller than the size
of database”. This is said to come from the fact that a trie
stores the same prefixes only once. The heuristic does not always result in the minimal trie,
In the case of APRIORI, collecting filtered transactions which is proven by the following example. Let us store
has a significant influence on run-time. This is due to the itemsets AX, AY , BXK, BXL, BM , BN in a trie. A
fact that finding candidates that occur in a given transaction trie that uses descending order according to frequencies
is a slow operation and the number of these procedure calls (B ≺ X ≺ A ≺ K ≺ L ≺ M ≺ N ) is depicted on
is considerably reduced. If a filtered transaction occurs n the left side of Figure 3. On the right side we can see a trie
times, then the expensive procedure will be called just once that uses an other order (items X and A are reversed) and
(with counter increment n) instead of n times (with counter has fewer nodes.
increment 1). A trie can be used to store the filtered trans-
action, and is actually used in the today’s fastest APRIORI
0 0
implementation made by Christian Borgelt [6]. A A
X
Is trie really the best solution for collecting filtered trans- B B
1 2 3 1 2
actions for APRIORI, or there exists a better solution? See N N Y
M A Y M X
the experimental results and explanation for the surprising 4
X
5 6 7 8 3
X
4 5 6 7
answer. L L
K K
9 10 8 9
4.2. Order of items
Figure 3. Example: descending order does
For quick insertion and to quickly decide if an itemset
not result the smallest trie
is stored in a trie, it is useful to store edges ordered ac-
cording to their labels (i.e. items in our case) and apply
a binary search. In [20] it was first noted that the order
of the items affects the shape of the trie. The next figure Descending order may not result in the smallest trie,
shows an example of two tries, that store the same itemsets when the trie has subtrees in which the order of items, ac-
(ABC, ABD, ACE) but use different orders (A > B > cording to the frequency of the subtree, does not correspond
C > D > E and its reverse). to the order of items according to the frequency in the whole
For the sake of reducing the memory need and traverse trie. This seldom occurs in real life databases (a kind of ho-
time, it would be useful to use the ordering that results in the mogeneousity exists), which is the reason for the success of
minimal trie. Comer and Sethi proved in [8] that the mini- the heuristic.
mal trie problem is NP-complete. On the other hand, a sim- Can this heuristic be applied in APRIORI as well? Fewer
ple heuristic (which was employed in FP-growth) performs nodes require less memory, also fewer nodes need to be vis-
very well in practice; use the descending order according ited during the support count (fewer recursive steps), which
to the frequencies. This is inspired by the fact that tries would suggest faster operation. However previous observa-
store same prefixes only once, and there is a higher chance tions [1] [6] claim the opposite. Here we conduct compre-
of itemsets having the same prefixes if frequent items have hensive experiments and try to find reasons for the contra-
small indices. diction.
3
4.3. Routing strategies at the nodes bitvector based: As mentioned before, a binary search can
be avoided if a non-compact representation is used.
Routing strategy in a trie refers to the method used at However this increases the memory need. A much bet-
an inner node to select the edge to follow. To find out if ter solution is to change the representation of the fil-
a single itemset I is contained in a trie, the best routing tered transaction rather than the nodes of the trie. We
strategy is to perform a binary search: at depth d − 1 we can use a bitvector instead of an ordered list. The ele-
have to find the edge whose label is the same as the d th item ment at index i is 1 if item i is stored in the transaction,
of I. The best routing strategy is not so straightforward, if and the length of the bitvector is the number of fre-
we have to find all the -itemset subsets of a given itemset, quent items. A bitvector needs more space if the size
that are contained in a given trie. This is the main step of of the filtered transaction is small, which is the general
support count in APRIORI and this is the step that primarily case in most applications. Hence, it is useful to store
determines the run-time of the algorithm. In this section we the filtered transactions as lists and convert them into
examine the possible solutions and propose a novel solution bitvectors if stored candidates have to be determined.
that is expected to be more efficient. The run-time of this solution is O(n).
In support count methods we have to find all the leaves indexvector based: The problem with bitvectors is that
that represent -itemset candidates that are contained in a they do not exploit the fact that at a certain depth only
given transaction t. Let us assume that arrive at a node at a part of the transaction needs to be examined. For ex-
depth d by following the j th item of the transaction. We ample, if the item of the first edge is the same as the last
move forward on links that have labels i ∈ t with an index item of the basket, then the other edges should not be
greater than j, but less than |t| − k + d + 1, because we examined. The bitvector-based approach does not take
need at least k − d − 1 items to reach a leaf that represents into consideration the positions of items in the basket.
a candidate. Or simply, given a part of the transaction (t ), We can easily overcome this problem if the indices of
we have to find the edges that correspond to an item in t . the items are stored in the vector. For example trans-
Items in the transactions and items of the edges are ordered. action {B, D, G} is stored as [0, 1, 0, 2, 0, 0, 3, 0, 0, 0]
The number of edges of the node is denoted by n. Two if the number of frequent items is 10. The routing
elementary solutions may be applicable: strategy with this vector is the following. We take the
items of the edges one-by-one. If item i is the actual,
simultaneous traversal: two pointers are maintained; one we check the element i of the vector. If it is 0, the
is initialized to the first element of t , and the other is item is not contained. If the element is smaller than
initialized to the item of the first edge. The pointer that |t| − k + d + 1 then match is found(and the support
points to the smaller item is increased. If the pointed count is processed with the next edge). Otherwise the
items are the same, then a match is found, and both procedure is terminated. The worst case run-time of
pointers are increased. Worst case run-time is O(n + this solution is O(n).
|t |).
We have presented six different routing strategies that
binary search: This includes two basic approaches and the do not change the structure of the trie. Theoretically no
combination of both: for each item in t we find the best strategy can be declared. However, our experiments
corresponding edge (if there is any), or for each edge have shown that the solution we proposed last always out-
the corresponding item of t . Run-times of the two performs the others.
approaches are O(|t | log n) and O(n log |t |), respec-
tively. Since the lists are ordered, it is not necessary to 4.4. Storing the frequent itemsets
perform a binary search on the whole list if a match is
found. For example if the first item in t corresponds
In FIM algorithms frequent itemsets that are not needed
to the label of the fifth edge, then for the second el-
by the algorithm can be written to the disk, and memory
ement in t we have to check labels starting from the
can be freed. For example in APRIORI we need frequent
sixth edge.
items of size only for the candidate generation of ( + 1)-
From the run-times we can see that if the size of t itemsets, and later they are not used. Consequently memory
is small and there are many edges, then the first kind need can be reduced if in the candidate generation the fre-
of binary search is faster and in the opposite case the quent itemsets are written out to disk and branches of the
second kind is better. We can combine the two solu- trie that do not lead to any candidate are pruned.
tions: if |t | log n < n log |t |, then we perform a bi- In most applications (for example to generate association
nary search on the edges – otherwise the binary search rules) frequent itemsets are mined in order to be used. In
is applied on t . such applications it is useful if the existence and the support
4
of the frequent itemsets can be quickly determined. Again, a edges that start from the root and have labels C, D. This
trie is a suitable data structure for storing frequent itemsets, is obviously useless work since these edges do not lead to
since frequent itemsets often share prefixes. nodes at depth 3, where the candidates can be found.
Regarding memory need, it is a wasteful solution to store To avoid this superfluous traveling, at every node we
frequent itemsets and candidates in a different trie. To illus- store the length of the longest directed path that starts from
trate this, examine the following tries. that node [5]. When searching for -itemset candidates at
depth d, we move downward only if the maximum path
Candidates
length at the pointed node is − d + 1. Storing maximum
path lengths requires memory, but it considerably reduces
B the search time for large itemsets.
A A better solution is to distinguish two kinds of edges. If
an edge is on the way to a candidate, then it is a dashed
C edge and any other edges are normal edges. This solution
C
B is shown in Figure 4. Dashed and normal edges belonging
to the same node are stored in different lists. During a sup-
D
C D D port count only dashed edges are taken into consideration.
This way many edges (the normal ones) are ignored even if
their label corresponds to some element of the transaction.
Frequent itemsets
With this solution pointer increments are reduced (the list
with dashed edges is shorter than the two lists together) and
C D we do not need to check if the edge leads to a candidate (a
B
A comparison with an addition is spared).
D D 4.5. Deleting unimportant transactions
C C D
B
Let us call a filtered transaction unimportant at the th
iteration of APRIORI, if it does not contain any ( − 1)-
itemset candidates. Unimportant transactions need mem-
The two tries above can easily be merged into one trie, ory and slow down support count, since a part of the trie
which is shown in Figure 4. is visited but no leaf representing a candidate is reached.
Consequently unimportant transactions should be removed,
i.e. if a filtered transaction does not contain any candidate
it should be removed from the memory and ignored in the
C D later phases of APRIORI. Due to the anti-monotonic prop-
B
A erty of the support function, an ignored transaction can not
contain candidates of greater sizes. Does this idea lead to
D D faster methods? Surprisingly, experiments do not suggest
C C D
B that it does.
D D D
C 5. Experimental results
All tests were carried out on ten public “benchmark”
databases, which can be downloaded from the FIM repos-
Figure 4. Example: Tries that store candi-
itory2 . Seven different min supp values were used for
dates and frequent itemsets
each database. Results would require too much space,
hence only the most typical ones are shown below. All re-
The memory need of the two tries (10 + 11 nodes) is sults, all programs and even the test scripts can be down-
more than the memory need of the merged trie (15 nodes), loaded from http://www.cs.bme.hu/˜bodon/en/
which stores candidates and frequent itemsets as well. The fim/test.html.
problem with a merged trie is that support counting be- Tests were run on a PC with a 1.8 GHz Intel P4 proces-
comes slower. This is due to the superfluous travel, i.e travel sor and 1 Gbytes of RAM. The operating system was De-
on routes that do not lead to candidates. For example, if bian Linux (kernel version: 2.4.24). Run-times and memory
the transaction contains items C, D, then we will follow the 2 http://fimi.cs.helsinki.fi/data/
5
usage were obtained using the time and memusage com- on the 70 measurements we can conclude that there is no
mand respectively. In the tables, time is given in seconds significant difference if the database is small (i.e the num-
and memory need in Mbytes. min f req denotes the fre- ber of filtered transaction is small), however – as expected
quency threshold, i.e min supp divided by the number of – sorted lists do slow down as the database grows. RB-trees
transactions. are always faster than tries, but the difference is not signifi-
cant (it was always under 30%).
5.1. Storing the transactions In support count of APRIORI, each filtered transaction
has to be visited to determine the contained candidates. If
First we compared three data structures used for storing transactions are stored in a sorted list, then going through
filtered transactions. The memory need and construction the elements is a very fast operation. In the case of RB-
time of the commonly used trie was compared to a sorted trees and trie, this operation is slower, since a tree has to
list and a red-black tree (denoted by RB-tree). RB-tree (or be traversed. However experiments showed that there is no
symmetric binary B-tree) is a self-balanced binary search significant difference when compared to building the data
tree with a useful characteristic: inserting an element needs structure, so it does not merit serious consideration.
O(log m), where m is the number of nodes (number of al- Experiments showed that RB-tree is the best data struc-
ready inserted filtered transaction in our case). ture for storing filtered transactions. It needs little memory
and it is the fastest with regard construction time. But why
min sorted trie RB- does RB-tree require less memory than trie, when trie stores
freq list tree the same prefixes only once and former and RB-tree stores
0.05 12.4 61.1 13.8 them as many times as they appear in a filtered transaction?
0.02 16.2 88.5 17.1 The answer to this comes from the fact that a trie has
0.0073 17.0 94.9 18.0 many more nodes – therefore many more edges – than an
0.006 17.1 95.3 18.1 RB-tree (except for one bit per node, RB-trees need the
Database: T40I10D100K same amount of memory as simple binary trees need). In
a trie each node stores a counter and a list of edges. For
Table 1. Memory need: storing filtered trans- each edge we have to store the label and the identifier of
actions the node the edge points to. Thus adding a node to a trie
increases memory need by at least 5 · 4 bytes (if items and
pointers are represented in 4 bytes). In a binary tree, like
All 70 tests support the same observation: single lists an RB-tree, the number of nodes equals to the number of
need the least memory, RB-trees need a bit more, and tries filtered transactions. Each node stores a filtered transaction
consume the most memory – up to 5-6 times more than RB- and its counter.
trees. When inserting the first k-itemset filtered transaction in
The next figure shows a typical result on the construction a trie, k nodes are created. However in an RB-tree we create
and destruction time of the different data structures. Based only one node. Although the same prefixes are stored only
once in a trie, this does not limit the memory increase as
Database: kosarak
250 much. This is the reason that a binary tree needs 3-10 times
sorted list
trie
RB-tree
less memory than a trie needs.
200
5.2. Order of items
150
To test the effect of ordering, we used 5 different orders:
time (sec)
ascending and descending order by support (first item has
100
the smallest/highest support) and three random orders. The
results with the random orders were always between the re-
50 sults of the ascending and descending order. First the con-
struction times and the memory needs of FP-trees (without
0 cross links) were examined. Experiments showed that there
0.01 0.001
Support threshold is little difference in the construction time whichever type of
ordering is used (the difference was always less than 8%).
Figure 5. Construction and destruction time: As expected, memory need is greatly affected by the order-
storing filtered transactions ing. Table 2 shows typical results.
Experiments with FP-trees with different orders meet our
6
min freq (%) 1 0.2 0.09 0.035 0.02 Database: BMS-POS
1400
ascending 42.48 58.03 61.34 63.6 65.04 ascending order
descending order
descending 27.58 39.74 41.69 43.66 44.10 1200
random 1 29.84 42.30 44.49 46.60 46.41
random 2 36.98 48.97 55.02 56.85 56.72 1000
random 3 34.87 52.18 55.68 58.01 55.50 800
time (sec)
Database: BMS-POS
600
Table 2. Memory need: FP-tree with different
orders 400
200
0
expectations: descending order leads to the smallest mem- 0.01 0.001
Support threshold
ory need, while ascending order leads to the highest. This
agrees with the heuristic; a trie is expected to have small Figure 6. APRIORI with different orders
number of nodes if the order of the items corresponds to the
descending order of supports.
Our experiments drew the attention to the fact that the fewer edges and hence the main step of support count (find-
memory need of FP-trees is greatly affected by the order ing the corresponding edges at a given node) is faster. The
used. The difference between ascending and descending or- second and more important factor is that a smaller number
der can be up to tenfold. In the basic FIM problem this of nodes is visited during the support count. This comes
does not cause any trouble; we can use the descending or- from the fact that nodes near the root have edges with rare
der. However in other FIM-related fields where the order frequent items as labels. This means that in the beginning
of the items cannot be chosen freely, this side-effect has of the support count the most selective items are checked.
to be taken into consideration. Such fields are prefix anti- Many transactions do not contain rare frequent items, hence
monotonic constraint based FIM algorithms, or FIM with the support count is terminated in the earliest phase. On the
multiple support threshold. For example, to handle prefix contrary, when descending order is used, many edges are
anti-monotonic constrains with FP-growth, we have to use visited before we get to the selective items. To illustrate
the order determined by the constraint [19]. A naive so- this fact, let us go back to Figure 2 and consider the task of
lution for handling constraints is to add post processing to determining the candidates in transaction {A, B, F, G, H}.
the FIM algorithm, where itemsets that do not return true Nodes 0,1,2 will be visited if descending order is used,
on all constraints are pruned. It can be more efficient if the while the search will be terminated immediately at the root
constraints are deeply embedded in the algorithm. In [19] in the case of the ascending order.
it was shown that a single prefix anti-monotonic predicate Concerning memory need, as expected, descending or-
can be effectively treated by FP-growth. Our experiments der is the best solution. However its advantage is insignifi-
proved that FP-growth is very sensitive to the order of the cant. APRIORI with ascending order never consumed more
items. Consequently, embedding a prefix anti-monotonic than 2% extra memory compared to APRIORI with de-
constraint into FP-growth does not trivially decrease re- scending order.
source need. Although search space can be reduced, we
have seen that this may greatly increase memory need and 5.3. Routing strategies at the nodes
thus traversal time.
Results on the effect of ordering in APRIORI are surpris- In the experiments of APRIORI with different rout-
ing (Figure 6). They contradict our initial claim. In almost ing strategies, we tested 5 competitors: (1) simultaneous
all experiments APRIORI with the ascending order turned traversal, (2-3) corresponding items were found by a bi-
out to be the fastest, and the one that used descending or- nary search on the items of the transaction/labels of the
dering was the slowest. Highest difference (6 fold) was in edges, and (4-5) transactions were represented by bitvec-
the case of retail.dat [7]. tors/vectors of indices. The results can not be characterized
These experiments support the previously stated, but fur- by a single table. Figure 7 and 8 present results with two
ther unexplained observation, i.e. ascending order is the databases. For a more comprehensive account the reader is
best order to use in APRIORI. We have seen that a trie that referred to the aforementioned test web page.
uses descending order is expected to have fewer nodes than Our newly proposed routing technique (i.e. indexvector
a trie that uses ascending order. However, ascending order based) almost always outperformed the other routing strate-
has two advantages over descending order. First, nodes have gies. Simultaneous traversal was the runner up. The other
7
Database: kosarak additions and even divisions. If we take into consideration
5500
simultaneous traversal
binary search on the transaction that calling a subroutine with parameters always means ex-
5000 binary search on the edges
vector of bits
vector of indices
tra value assignments, then we can conclude the overhead
4500
of the binary search is significant and is not profitable, if
4000
the list we are searching through is short. In our case nei-
3500
ther the number of edges of the nodes nor the number of
time (sec)
3000
frequent items in the transaction is large. This explains the
2500
bad performance of binary search in our case.
2000
By using a binary vector we can avoid binary search
1500
with all of its overhead. However, experiments showed
1000
that although the bitvector based approach was better than
500 binary search-based approaches, it could not outperform
0
0.001
the simultaneous traversal. This is because a bitvector-
Support threshold
based approach does not take into consideration that only
a part of the transaction has to be examined. Let us see
Figure 7. APRIORI with different routing an example. Assume that the only 4-itemset candidate is
strategies {D, E, F, G} and we have to find the candidates in transac-
tion {A, B, C, D, E, F }. Except for the bitvector-based ap-
proach all the techniques considered will not visit any node
Database: BMS-POS
700
in the trie, because there is no edge of the root whose la-
simultaneous traversal
binary search on the transaction
binary search on the edges
bel corresponds to any of the first 6 − 4 + 1 = 3 items
600 vector of bits
vector of indices in the transaction. On the contrary, the bitvector-based ap-
proach uses the whole transaction and starts with a super-
500
fluous travel that goes down even to depth 3. A vector that
400 stores the indices (the 5 th competitor) overcomes this defi-
time (sec)
ciency. This seems to be the reason behind the good perfor-
300
mance (first place most of the time).
200
5.4. Storing the frequent itemsets
100
Figure 9 shows typical run-times of three different vari-
0
0.01 0.001 ants of APRIORI. In the first and second frequent itemset
Support threshold
were stored in the memory, in the third they were written to
disk. To avoid superfluous traversing, maximum path val-
Figure 8. APRIORI with different routing
ues were used in the first APRIORI, and different lists of
strategies
edges in the second.
Memory needs of the three implementations are found in
the next table.
three alternated in performance. The bitvector approach
min maximum different written
was most often the least effective. Routing that employed
freq (%) paths edgelists out
binary search on the edges sometimes lead to extremely
0.064 3.48 3.98 2.38
bad result,but on one database (retail) it finished in first
0.062 7.38 8.98 3.61
place.
0.06 14.3 17.9 6.0
Simultaneous traversal performed very well, and almost 0.058 33.5 43.5 13.2
always beat the binary search approaches. Theoretical run- 0.056 133.6 176.0 47.8
times do not necessarily support this. To understand why si- Database: BMS-WebView-1
multaneous traversal outperformed the binary search based
approaches, we have to take a closer look at them. Table 3. Memory need: different frequent
Simultaneous traversal is a very simple method; it com- itemset handling
pares the values of the pointers and increments one or both
of them. These elementary operations are very fast. In bi-
nary search approach we increment pointers and invoke bi- As expected, we obtain the fastest APRIORI if frequent
nary searches. In each binary search we make comparisons, itemsets are not stored in memory but written to disk in the
8
Database: T40I10D100K the difference was insignificant (under 1%). The trick ac-
300
maximum paths
different edgelists
celerated the algorithm only for the retail database.
written out
250
Deleting unimportant transactions was expected to de-
crease run-time. However test results showed the contrary.
200
This is attributed two extra cost factors that come into play
as soon as we want to delete unimportant transactions.
time (sec)
150 First, we have to determine if a given transaction con-
tains any candidates. This means some overhead (one as-
100 signment at each elementary step of the support count) and
does not save time during APRIORI in cases where the
50 transaction contains candidates (which is the typical case).
The second kind of extra cost comes from the fact that
0
0.01
filtered transactions were stored in an RB-tree. For this we
Support threshold used map implemented in STL. However deleting an entry
from an RB-tree is not as cheap as deleting a leaf from a
Figure 9. APRIORI with different frequent simple binary tree. The red-black property has to be main-
itemset handling techniques tained which sometimes requires the expensive rotation op-
eration. Notice that after determining the multiplicity of the
filtered transactions we don’t need to maintain any sophis-
ticated data structures, only the filtered transactions and the
candidate generation process. Experiments show that this
multiplicity values are needed for the support count. Con-
APRIORI, however, is not significantly faster than APRI-
sequently, the second extra cost problem can be overcome
ORI that stores maximum path lengths. This comes from
in two ways. We can copy the filtered transactions and the
the fact that APRIORI spends most of the time in determin-
counters of the RB-tree into a list or we may let the delete
ing the support of small and medium sized candidates. In
operations invalidate the red-black property of the tree. Test
such cases most edges lead to leaves, hence removing other
results showed that even with these modifications, deleting
edges does not accelerate the algorithm too much.
unimportant transactions does not lead to a faster APRIORI.
However, the memory need can be significantly reduced
if frequent itemsets are not stored in memory. Experiments
show that memory need may even decrease to the third or 5.6. Overall performance gain
the quarter. Consequently if frequent itemsets are needed
to determine valid association rules and memory consump-
tion is an important factor, then it is useful to write frequent With our last experiment we would like to illustrate the
itemsets to disk and read them back when the association overall performance gain of the prospective improvements.
rule discovery phase starts. Two APRIORI implementations with different trie related
options are compared. In the first ascending order, simul-
5.5. Deleting unimportant transactions taneous traversal is used and filtered transactions are stored
in an RB-tree. In the second implementation descending or-
Test results for deleting unimportant transactions contra- der, binary search on the edges is applied and filtered trans-
dict our expectations. Typical run-times are listed in Table actions are not collected.
4.
min freq (%) 90 83 75 71 67 65.5
min freq (%) 90 81 75 71 69 67 original 213 2616 16315 34556 71265 stopped
non-delete 4.76 11.82 65.1 430 905 1301 new 3.5 9.3 66 158 365 706
delete 4.48 12.09 66.5 442 917 1339 Database: connect
Database: pumsb
Table 5. Comparing run-time of two APRIORI
Table 4. Run-time: deleting unimportant with different options
transactions
In six out of ten databases, deleting unimportant trans- The results support our claim, that suitable data structure
action slowed down APRIORI. In the other three databases techniques lead to a remarkable improvements.
9
5.7. Effect of programming techniques the factors we ran into that have a large impact on efficiency.
These are: (1) when to use sorted vector instead of an RB-
Most papers on FIM focus on new algorithms, new can- tree (i.e. sorted vector vs. map), (2) when to a store
didate generation methods, support count techniques or data pointers instead of objects in a container (double referenc-
structure-related issues. Less attention is paid to details, ing vs. copy constructor), (3) memory management of the
like the ones mentioned above, despite the fact that these containers and the need for using the “swap trick” to avoid
tricks are able to speed up algorithms that are not regarded unnecessary memory occupation, (4) contiguous-memory
as being very fast. The fastest APRIORI implementation containers vs. node-based containers, (5) iterators vs. us-
[6], which incorporates many sophisticated techniques, sup- ing the index operator, etc. These issues are important. For
ports this claim by finishing among the best implementa- example, when a vector storing pointers of objects was sub-
tions (beating many newer algorithms) in the first FIM con- stituted by a vector that stores simply the objects and copy
test [11]. constructor overhead was avoided by reserving memory in
Besides the algorithmic and data-structure issues there is advance, the run-time decreased significantly (for example
a third factor that quite possibly influences effectiveness. to one third in the case of the T10I4D100K database). For
This factor is programming technique. FIM algorithms more information on STL-related questions, the reader is
are computationally expensive and therefore, no algorithms referred to [22] [17].
that are implemented in a high level programming language
(like Java, C#) are competitive with lower level implemen-
6. APRIORI implementation submitted to
tations. The language C is widely used, flexible and effec-
tive, which is the reason why every competitive FIM imple- FIMI’04
mentation is implemented in C.
Unfortunately, C gives too much freedom to the imple- Based on our theoretical and experimental analysis, we
mentors. Elementary operations like binary search, building implemented a fast APRIORI algorithm. Since we believe
different trees, list operations, memory-allocation, etc. can that readability is also very important we sacrificed an in-
be done in many ways. This is a disadvantage because effi- significant amount of efficiency if it led to a simpler code.
ciency depends on the way these elementary operations are Our final implementation uses a red-black tree to store fil-
programmed. This may also be the reason for the perfor- tered transactions, item order is ascending according to their
mance gain of a FIM implementation. An experienced pro- support, simultaneous traversal is used as a routing strat-
grammer is more likely to code a fast FIM algorithm than a egy, nodes representing frequent itemsets, but playing no
FIM expert with less programming experience. role in support count, are pruned and unimportant filtered
A tool that provides standard procedures for the ele- transactions are not removed. Moreover, a single vector
mentary operations has double advantage. Efficiency of and an array is used to quickly find the support of one
the implementation would only depend on the algorithm and two itemset candidates [18] [5]. The implementation
itself. The code would be more readable and maintain- (version 2.4.1 at the time of writing) is fully documented
able because of the higher level of abstraction. Such a tool and can be freely downloaded for research purposes at
exists – it is the C++ and the Standard Template Library http://www.cs.bme.hu/˜bodon/en/apriori.
(STL). STL provides containers (general data structures), A version that uses a simplified output format and con-
algorithms and functions that were carefully programmed figuration options was submitted to the FIMI’04 contest.
by professional programmers. By using STL the code will
be easier to read and less prone to error, while maintain-
ing efficiency (STL algorithms are asymptotically optimal). 7. Conclusion
Due to these advantages, STL should be used whenever pos-
sible. Actually it could be the “common language” among
data mining programmers. In this paper we analyzed speed-up techniques of trie-
Besides the aforementioned advantages of STL, it also based algorithms. Our main target was the algorithm APRI-
introduces some dangers. To make good use of STL’s ca- ORI, however, some trie-related issues also apply to other
pabilities, we first need to have more advanced knowledge algorithms like FP-growth. We also presented new tech-
about them. Our experiments on STL-related issues showed niques that result in a faster APRIORI.
that small details and small changes can lead to high varia- Experiments proved that these data-structure issues
tions in run-time or memory need. The goal of this paper is greatly affect the run-time and memory consumption of the
to draw attention to data-structure related issues, however, algorithms. A carefully chosen combination of these tech-
we also have to mention some STL-related issues that have niques can lead to a 2 to 1000 fold decrease in run-time,
to be taken into careful consideration. Next, we list some of without significantly increasing memory consumption.
10
Acknowledgment In Proceedings of the 4th European Conference on Princi-
ples of Data Mining and Knowledge Discovery, pages 13–
23. Springer-Verlag, 2000.
The author would like to thank Balázs Rácz for his valu-
[14] M. Kuramochi and G. Karypis. Frequent subgraph discov-
able STL and programming related comments and for in- ery. In Proceedings of the first IEEE International Confer-
sightful discussions. ence on Data Mining, pages 313–320, 2001.
[15] H. Mannila and H. Toivonen. Discovering generalized
References episodes using minimal occurrences. In Proceedings of the
Second International Conference on Knowledge Discovery
and Data Mining (KDD’96), pages 146–151. AAAI Press,
[1] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A tree August 1996.
projection algorithm for generation of frequent item sets. [16] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discov-
Journal of Parallel and Distributed Computing, 61(3):350– ering frequent episodes in sequences. In Proceedings of the
371, 2001. First International Conference on Knowledge Discovery and
[2] R. Agrawal, T. Imielinski, and A. Swami. Mining associa- Data Mining (KDD’95), pages 210–215. AAAI Press, Au-
tion rules between sets of items in large databases. Proceed- gust 1995.
ings of the ACM SIGMOD Conference on Management of [17] S. Meyers. Effective STL: 50 specific ways to improve your
Data, pages 207–216, 1993. use of the standard template library. Addison-Wesley Long-
[3] R. Agrawal and R. Srikant. Fast algorithms for mining as- man Ltd., 2001.
sociation rules. The International Conference on Very Large [18] B. Özden, S. Ramaswamy, and A. Silberschatz. Cyclic
Databases, pages 487–499, 1994. association rules. In Proceedings of the Fourteenth Inter-
[4] R. Agrawal and R. Srikant. Mining sequential patterns. national Conference on Data Engineering, pages 412–421.
In P. S. Yu and A. L. P. Chen, editors, Proceedings of the IEEE Computer Society, 1998.
11th International Conference on Data Engineering, ICDE, [19] J. Pei, J. Han, and L. V. S. Lakshmanan. Mining frequent
pages 3–14. IEEE Computer Society, 6–10 1995. item sets with convertible constraints. In Proceedings of the
[5] F. Bodon. A fast apriori implementation. In B. Goethals 17th International Conference on Data Engineering, pages
and M. J. Zaki, editors, Proceedings of the IEEE ICDM 433–442. IEEE Computer Society, 2001.
Workshop on Frequent Itemset Mining Implementations [20] T. Rotwitt, Jr. and P. A. D. de Maine. Storage optimization
(FIMI’03), volume 90 of CEUR Workshop Proceedings, of tree structured files. In E. F. Codd and A. L. Dean, editors,
Melbourne, Florida, USA, 19. November 2003. Proceedings of 1971 ACM-SIGFIDET Workshop on Data
[6] C. Borgelt. Efficient implementations of apriori and eclat. Description, Access and Control, pages 207–217. ACM,
In B. Goethals and M. J. Zaki, editors, Proceedings of the November 11-12 1971.
IEEE ICDM Workshop on Frequent Itemset Mining Imple- [21] D. G. Severance. Identifier search mechanisms: A survey
mentations (FIMI’03), volume 90 of CEUR Workshop Pro- and generalized model. ACM Comput. Surv., 6(3):175–194,
ceedings, Melbourne, Florida, USA, 19. November 2003. 1974.
[7] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using [22] B. Stroustrup. The C++ Programming Language, Special
association rules for product assortment decisions: A case Edition. Addison-Wesley Verlag, Bosten, 2000.
study. In Proceedings of the sixth International Conference [23] S. B. Yao. Tree structures construction using key densities.
on Knowledge Discovery and Data Mining, pages 254–260, In Proceedings of the 1975 annual conference, pages 337–
1999. 342. ACM Press, 1975.
[8] D. Comer and R. Sethi. The complexity of trie index con- [24] M. J. Zaki. Efficiently mining frequent trees in a forest. In
struction. J. ACM, 24(3):428–440, 1977. Proceedings of the eighth ACM SIGKDD international con-
[9] R. de la Briandais. File searching using variable-length ference on Knowledge discovery and data mining, pages 71–
keys. In Western Joint Computer Conference, pages 295– 80. ACM Press, 2002.
298, March 1959. [25] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New al-
[10] E. Fredkin. Trie memory. Communications of the ACM, gorithms for fast discovery of association rules. In D. Heck-
3(9):490–499, 1960. erman, H. Mannila, D. Pregibon, R. Uthurusamy, and
[11] B. Goethals and M. J. Zaki. Advances in frequent item- M. Park, editors, Proceedings of the 3rd International Con-
set mining implementations: Introduction to fimi03. In ference on Knowledge Discovery and Data Mining, pages
B. Goethals and M. J. Zaki, editors, Proceedings of the IEEE 283–296. AAAI Press, 12–15 1997.
ICDM Workshop on Frequent Itemset Mining Implementa-
tions (FIMI’03), volume 90 of CEUR Workshop Proceed-
ings, Melbourne, Florida, USA, 19. November 2003.
[12] J. Han, J. Pei, and Y. Yin. Mining frequent patterns with-
out candidate generation. In Proceedings of the 2000 ACM
SIGMOD international conference on Management of data,
pages 1–12. ACM Press, 2000.
[13] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based al-
gorithm for mining frequent substructures from graph data.
11