<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Surprising results of trie-based FIM algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ferenc Bodon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>bodon@cs.bme.hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Theory, Budapest University of Technology and Economics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Trie is a popular data structure in frequent itemset mining (FIM) algorithms. It is memory-efficient, and allows fast construction and information retrieval. Many trie-related techniques can be applied in FIM algorithms to improve efficiency. In this paper we propose new techniques for fast management, but more importantly we scrutinize the well-known ones especially those which can be employed in APRIORI. The theoretical claims are supported by results of a comprehensive set of experiments, based on hundreds of tests that were performed on numerous databases, with different support thresholds. We offer some surprising conclusions, which at some point contradict published claims.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Frequent itemset mining is the most researched field of
frequent pattern mining. Techniques and algorithms
developed here are often used in search for other types of patterns
(like sequences, rooted trees, boolean formulas, graphs).
The original problem was to discover association rules [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
where the main step was to find frequently occurring
itemsets. Over one hundred FIM algorithms were proposed –
the majority claiming to be the most efficient. The truth
is that no single most efficient algorithm exists; there is no
published method that outperforms every other method on
every dataset with every support threshold [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However,
there are three algorithms that play central role due to their
efficiency and the fact that many algorithms are
modifications or combinations of these basic methods. These
algorithms are APRIORI [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Eclat [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and FP-growth [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Those who employed one of the basic algorithms as a
search strategy, tended to employ the whole set of
procedures and data structures as well, which is trie (prefix tree)
∗Research is partially supported by the Hungarian National Science
Foundation (Grant No. OTKA TS-044733, T42706, T42481).
in the case of APRIORI and FP-growth. Therefore it is
useful and instructive to analyze tries, and clarify those details
that have an effect on run-time or memory need. In this
paper we will see, that small details can have a large influence
on efficiency and taking a closer look at them brings up new
questions and new solutions.</p>
      <p>The rest of the paper is organized as follows. The
problem is presented in Section 2, trie and its role in FIM is
described in Section 3. Section 4 introduces accelerating
techniques one-by-one. Surprising experimental results and
the explanations are given in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem statement</title>
      <p>Frequent itemset mining is a special case of frequent
pattern mining. Let us first describe this general case. We
assume that the reader is familiar with the basics of poset
theory. We call a poset (P, ) locally finite, if every
interval [x, y] is finite, i.e. the number of elements z, such that
x z y is finite. The element x covers y, if y x and
for any y z, z x.</p>
      <p>Definition 1. We call the poset P C = (P, ) pattern
context, if there exists exactly one minimal element, P C is
locally finite and graded, i.e. there exists a size function
| | : P → Z, such that |p| = |p | + 1, if p covers p . The
elements of P are called patterns and P is called the pattern
space1 or pattern set.</p>
      <p>Without loss of generality we assume that the size of the
minimal pattern is 0 and it is called the empty pattern.</p>
      <p>
        In the frequent pattern mining problem we are given the
set of input data T, the pattern context P C = (P, ), the
anti-monotonic function supp T : P → N and min supp ∈
N. We have to find the set F = {p ∈ P : suppT(p) ≥
1Many researchers improperly refer to the pattern space as the pattern
lattice. It may come from the fact that patterns were first examined as sets
of items [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], when (P, ) actually formed a lattice. However this property
does not hold for many other types of patterns. It is easy to prove that if the
type of pattern is sequence, boolean formula or graph, then the least upper
bound is not unique.
min supp} and the support of the patterns in F . Elements
of F are called frequent patterns, suppT is the support
function and min supp is referred to as support threshold.
      </p>
      <p>
        There are many types of patterns: itemsets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], item
sequences, sequences of itemsets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], episodes [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], boolean
formulas [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], rooted labeled ordered/unordered trees [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
labeled induced subgraphs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], labeled subgraphs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In
frequent itemset mining [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the pattern context is (2 I, ⊆),
where I is a given set, ⊆ is the usual subset relation, and the
input data is a sequence of transactions (T = t1, . . . , tm ).
The elements of I are called items and each transaction is
a set of items. The support of itemset I is the number of
transactions that contain I as a subset.
      </p>
      <p>There exist many algorithms which efficiently solve the
FIM problem. Most of them are APRIORI and FP-growth
based where efficiency comes from the sophisticated use of
the trie data structure. The goal of our work is to
scrutinize the use of trie theoretically and experimentally. Our
claims are supported by hundreds of experiments based on
many different databases with various support thresholds.
We believe that a result of our work was to clarify
important technical details of APRIORI, and to take some steps
towards finding the best implementation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Tries</title>
      <p>
        The data structure trie was originally introduced by de
la Briandais [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Fredkin [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to store and efficiently
retrieve words of a dictionary. A trie is a rooted, labeled
tree. The root is defined to be at depth 0, and a node at
depth d can point to nodes at depth d + 1. A pointer is also
referred to as edge or link. If node u points to node v, then
we call u the parent of v, and v is a child node of u. For
the sake of efficiency – concerning insertion and deletion –
a total order on the labels of edges has to be defined.
      </p>
      <p>Tries are suitable for storing and retrieving not only
words, but any finite sets (or sequences). In FIM algorithms
tries (also called a lexicographic tree) are used to quickly
determine the support of itemsets whose size is greater than
2. In the FIM setting a link is labeled by a frequent item, and
a node represents an itemset, which is the set of the items in
the path from the root to the leaf. The label of a node stores
the counter of the itemset that the node represents.</p>
      <p>Figure 1 presents a trie (without the counters) that stores
the itemsets {A}, {C}, {E}, {F}, {A,C}, {A,E}, {A,F},
{E,F}, {A,E,F}. Building a trie is straightforward; we omit
the details.</p>
      <p>Tries can be implemented in many ways. In compact
representation the edges of a node are stored in a vector.
Each element of a vector is a pair; the first element stores
the label of the edge, the second stores the address of the
node, which the edge points to. This solution is very similar
to the widespread “doubly chained” representation, where
C</p>
      <p>E</p>
      <p>F</p>
      <p>A
F
edges of a node are stored in a linked list.</p>
      <p>
        In the non compact representation (also called tabular
implementation by Fredkin) only the pointers are stored in
a vector with a length equal to that of the alphabet
(frequent items in our case). An element at index i belongs
to the edge whose label is the ith item. If there is no edge
with such a label, then the element is NIL. This solution
has the advantage of finding an edge with a given label in
O(1) time, instead of O(log n) required by a binary search –
which is the case in compact representation. Unfortunately
for nodes with few edges this representation requires some
more memory than compact representation. On the
contrary, if a node has many edges (exact formula can be given
based on the memory need of pointers and labels), then the
non-compact representation needs less memory since labels
are not stored explicitly. According to the memory need the
two approaches can be combined [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. If there are
many nodes with single edges, then further memory can be
saved by using patricia tries. Throughout this paper and in
the implementations compact representation of tries is used.
      </p>
      <p>Tries are used in FIM algorithms in two ways. In
APRIORI based algorithms the tries store candidates (itemsets
whose support has to be determined), and in APRIORI and
FP-growth based algorithms the input sequence (more
precisely a projection of the input) is stored in a trie.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Techniques for fast management</title>
      <p>In this section we take trie issues one-by-one, describe
the problem and present previous claims or naive
expectations. Some issues apply to APRIORI and FP-growth based
algorithms, but some apply only to the first algorithm
family.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1. Storing the transactions</title>
      <p>Let us call the itemset that is obtained by removing
infrequent items from t the filtered transaction of t. All frequent
itemsets can be determined even if only filtered transactions
are available. To reduce IO cost and speed up the algorithm,
the filtered transactions can be stored in main memory
instead of on disk. It is useless to store the same filtered
transactions multiple times. Instead store them once and employ
counters which store the multiplicities. This way memory
is saved and run-time can be significantly improved.</p>
      <p>This fact is used in FP-growth and can be used in
APRIORI as well. In FP-growth the filtered transactions are
stored in an FP-tree, which is a trie with cross-links and
a header table. Size of the FP-tree that stores filtered
transactions is declared to be “substantially smaller than the size
of database”. This is said to come from the fact that a trie
stores the same prefixes only once.</p>
      <p>
        In the case of APRIORI, collecting filtered transactions
has a significant influence on run-time. This is due to the
fact that finding candidates that occur in a given transaction
is a slow operation and the number of these procedure calls
is considerably reduced. If a filtered transaction occurs n
times, then the expensive procedure will be called just once
(with counter increment n) instead of n times (with counter
increment 1). A trie can be used to store the filtered
transaction, and is actually used in the today’s fastest APRIORI
implementation made by Christian Borgelt [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Is trie really the best solution for collecting filtered
transactions for APRIORI, or there exists a better solution? See
the experimental results and explanation for the surprising
answer.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Order of items</title>
      <p>
        For quick insertion and to quickly decide if an itemset
is stored in a trie, it is useful to store edges ordered
according to their labels (i.e. items in our case) and apply
a binary search. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] it was first noted that the order
of the items affects the shape of the trie. The next figure
shows an example of two tries, that store the same itemsets
(ABC, ABD, ACE) but use different orders (A &gt; B &gt;
C &gt; D &gt; E and its reverse).
      </p>
      <p>
        For the sake of reducing the memory need and traverse
time, it would be useful to use the ordering that results in the
minimal trie. Comer and Sethi proved in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that the
minimal trie problem is NP-complete. On the other hand, a
simple heuristic (which was employed in FP-growth) performs
very well in practice; use the descending order according
to the frequencies. This is inspired by the fact that tries
store same prefixes only once, and there is a higher chance
of itemsets having the same prefixes if frequent items have
small indices.
2
4
      </p>
      <p>C</p>
      <p>B
D
0
1
5</p>
      <p>A</p>
      <p>C
3
6</p>
      <p>E
1
4
7</p>
      <p>C
A</p>
      <p>E</p>
      <p>D
B
A</p>
      <p>C
3
6
9</p>
      <p>B
A
The heuristic does not always result in the minimal trie,
which is proven by the following example. Let us store
itemsets AX , AY , BX K, BX L, BM , BN in a trie. A
trie that uses descending order according to frequencies
(B ≺ X ≺ A ≺ K ≺ L ≺ M ≺ N ) is depicted on
the left side of Figure 3. On the right side we can see a trie
that uses an other order (items X and A are reversed) and
has fewer nodes.</p>
      <p>9
4
K</p>
      <p>X
L
10
1
5
M N</p>
      <p>B
0
2
X A</p>
      <p>A
X
1
4
M N</p>
      <p>B
5</p>
      <p>A
6
2
5
8
0</p>
      <p>Descending order may not result in the smallest trie,
when the trie has subtrees in which the order of items,
according to the frequency of the subtree, does not correspond
to the order of items according to the frequency in the whole
trie. This seldom occurs in real life databases (a kind of
homogeneousity exists), which is the reason for the success of
the heuristic.</p>
      <p>
        Can this heuristic be applied in APRIORI as well? Fewer
nodes require less memory, also fewer nodes need to be
visited during the support count (fewer recursive steps), which
would suggest faster operation. However previous
observations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] claim the opposite. Here we conduct
comprehensive experiments and try to find reasons for the
contradiction.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4.3. Routing strategies at the nodes</title>
      <p>Routing strategy in a trie refers to the method used at
an inner node to select the edge to follow. To find out if
a single itemset I is contained in a trie, the best routing
strategy is to perform a binary search: at depth d − 1 we
have to find the edge whose label is the same as the dth item
of I. The best routing strategy is not so straightforward, if
we have to find all the -itemset subsets of a given itemset,
that are contained in a given trie. This is the main step of
support count in APRIORI and this is the step that primarily
determines the run-time of the algorithm. In this section we
examine the possible solutions and propose a novel solution
that is expected to be more efficient.</p>
      <p>In support count methods we have to find all the leaves
that represent -itemset candidates that are contained in a
given transaction t. Let us assume that arrive at a node at
depth d by following the j th item of the transaction. We
move forward on links that have labels i ∈ t with an index
greater than j, but less than |t| − k + d + 1, because we
need at least k − d − 1 items to reach a leaf that represents
a candidate. Or simply, given a part of the transaction (t ),
we have to find the edges that correspond to an item in t .
Items in the transactions and items of the edges are ordered.
The number of edges of the node is denoted by n. Two
elementary solutions may be applicable:
simultaneous traversal: two pointers are maintained; one
is initialized to the first element of t , and the other is
initialized to the item of the first edge. The pointer that
points to the smaller item is increased. If the pointed
items are the same, then a match is found, and both
pointers are increased. Worst case run-time is O(n +
|t |).
binary search: This includes two basic approaches and the
combination of both: for each item in t we find the
corresponding edge (if there is any), or for each edge
the corresponding item of t . Run-times of the two
approaches are O(|t | log n) and O(n log |t |),
respectively. Since the lists are ordered, it is not necessary to
perform a binary search on the whole list if a match is
found. For example if the first item in t corresponds
to the label of the fifth edge, then for the second
element in t we have to check labels starting from the
sixth edge.</p>
      <p>From the run-times we can see that if the size of t
is small and there are many edges, then the first kind
of binary search is faster and in the opposite case the
second kind is better. We can combine the two
solutions: if |t | log n &lt; n log |t |, then we perform a
binary search on the edges – otherwise the binary search
is applied on t .
bitvector based: As mentioned before, a binary search can
be avoided if a non-compact representation is used.
However this increases the memory need. A much
better solution is to change the representation of the
filtered transaction rather than the nodes of the trie. We
can use a bitvector instead of an ordered list. The
element at index i is 1 if item i is stored in the transaction,
and the length of the bitvector is the number of
frequent items. A bitvector needs more space if the size
of the filtered transaction is small, which is the general
case in most applications. Hence, it is useful to store
the filtered transactions as lists and convert them into
bitvectors if stored candidates have to be determined.</p>
      <p>
        The run-time of this solution is O(n).
indexvector based: The problem with bitvectors is that
they do not exploit the fact that at a certain depth only
a part of the transaction needs to be examined. For
example, if the item of the first edge is the same as the last
item of the basket, then the other edges should not be
examined. The bitvector-based approach does not take
into consideration the positions of items in the basket.
We can easily overcome this problem if the indices of
the items are stored in the vector. For example
transaction {B, D, G} is stored as [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0, 1, 0, 2, 0, 0, 3, 0, 0, 0</xref>
        ]
if the number of frequent items is 10. The routing
strategy with this vector is the following. We take the
items of the edges one-by-one. If item i is the actual,
we check the element i of the vector. If it is 0, the
item is not contained. If the element is smaller than
|t| − k + d + 1 then match is found(and the support
count is processed with the next edge). Otherwise the
procedure is terminated. The worst case run-time of
this solution is O(n).
      </p>
      <p>We have presented six different routing strategies that
do not change the structure of the trie. Theoretically no
best strategy can be declared. However, our experiments
have shown that the solution we proposed last always
outperforms the others.</p>
    </sec>
    <sec id="sec-8">
      <title>4.4. Storing the frequent itemsets</title>
      <p>In FIM algorithms frequent itemsets that are not needed
by the algorithm can be written to the disk, and memory
can be freed. For example in APRIORI we need frequent
items of size only for the candidate generation of ( +
1)itemsets, and later they are not used. Consequently memory
need can be reduced if in the candidate generation the
frequent itemsets are written out to disk and branches of the
trie that do not lead to any candidate are pruned.</p>
      <p>In most applications (for example to generate association
rules) frequent itemsets are mined in order to be used. In
such applications it is useful if the existence and the support</p>
    </sec>
    <sec id="sec-9">
      <title>Frequent itemsets</title>
      <p>The two tries above can easily be merged into one trie,
which is shown in Figure 4.</p>
      <p>A
D</p>
      <p>A</p>
      <p>A
C</p>
      <p>D</p>
      <p>B
C</p>
      <p>D</p>
      <p>B
C D</p>
      <p>C
D</p>
      <p>D
D</p>
      <p>C</p>
      <p>B</p>
      <p>D</p>
      <p>C D</p>
      <p>D
C</p>
      <p>B</p>
      <p>D
B</p>
      <p>C</p>
      <p>D
C</p>
      <p>D</p>
      <p>B</p>
      <p>C</p>
      <p>D</p>
      <p>The memory need of the two tries (10 + 11 nodes) is
more than the memory need of the merged trie (15 nodes),
which stores candidates and frequent itemsets as well. The
problem with a merged trie is that support counting
becomes slower. This is due to the superfluous travel, i.e travel
on routes that do not lead to candidates. For example, if
the transaction contains items C, D, then we will follow the
of the frequent itemsets can be quickly determined. Again, a
trie is a suitable data structure for storing frequent itemsets,
since frequent itemsets often share prefixes.</p>
      <p>Regarding memory need, it is a wasteful solution to store
frequent itemsets and candidates in a different trie. To
illustrate this, examine the following tries.</p>
    </sec>
    <sec id="sec-10">
      <title>Candidates</title>
      <p>edges that start from the root and have labels C, D. This
is obviously useless work since these edges do not lead to
nodes at depth 3, where the candidates can be found.</p>
      <p>
        To avoid this superfluous traveling, at every node we
store the length of the longest directed path that starts from
that node [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. When searching for -itemset candidates at
depth d, we move downward only if the maximum path
length at the pointed node is − d + 1. Storing maximum
path lengths requires memory, but it considerably reduces
the search time for large itemsets.
      </p>
      <p>A better solution is to distinguish two kinds of edges. If
an edge is on the way to a candidate, then it is a dashed
edge and any other edges are normal edges. This solution
is shown in Figure 4. Dashed and normal edges belonging
to the same node are stored in different lists. During a
support count only dashed edges are taken into consideration.
This way many edges (the normal ones) are ignored even if
their label corresponds to some element of the transaction.
With this solution pointer increments are reduced (the list
with dashed edges is shorter than the two lists together) and
we do not need to check if the edge leads to a candidate (a
comparison with an addition is spared).</p>
    </sec>
    <sec id="sec-11">
      <title>4.5. Deleting unimportant transactions</title>
      <p>Let us call a filtered transaction unimportant at the th
iteration of APRIORI, if it does not contain any ( −
1)itemset candidates. Unimportant transactions need
memory and slow down support count, since a part of the trie
is visited but no leaf representing a candidate is reached.
Consequently unimportant transactions should be removed,
i.e. if a filtered transaction does not contain any candidate
it should be removed from the memory and ignored in the
later phases of APRIORI. Due to the anti-monotonic
property of the support function, an ignored transaction can not
contain candidates of greater sizes. Does this idea lead to
faster methods? Surprisingly, experiments do not suggest
that it does.</p>
    </sec>
    <sec id="sec-12">
      <title>5. Experimental results</title>
      <p>All tests were carried out on ten public “benchmark”
databases, which can be downloaded from the FIM
repository2. Seven different min supp values were used for
each database. Results would require too much space,
hence only the most typical ones are shown below. All
results, all programs and even the test scripts can be
downloaded from http://www.cs.bme.hu/˜bodon/en/
fim/test.html.</p>
      <p>Tests were run on a PC with a 1.8 GHz Intel P4
processor and 1 Gbytes of RAM. The operating system was
Debian Linux (kernel version: 2.4.24). Run-times and memory
2http://fimi.cs.helsinki.fi/data/
usage were obtained using the time and memusage
command respectively. In the tables, time is given in seconds
and memory need in Mbytes. min f req denotes the
frequency threshold, i.e min supp divided by the number of
transactions.</p>
    </sec>
    <sec id="sec-13">
      <title>5.1. Storing the transactions</title>
      <p>First we compared three data structures used for storing
filtered transactions. The memory need and construction
time of the commonly used trie was compared to a sorted
list and a red-black tree (denoted by RB-tree). RB-tree (or
symmetric binary B-tree) is a self-balanced binary search
tree with a useful characteristic: inserting an element needs
O(log m), where m is the number of nodes (number of
already inserted filtered transaction in our case).</p>
      <p>All 70 tests support the same observation: single lists
need the least memory, RB-trees need a bit more, and tries
consume the most memory – up to 5-6 times more than
RBtrees.</p>
      <p>The next figure shows a typical result on the construction
and destruction time of the different data structures. Based
on the 70 measurements we can conclude that there is no
significant difference if the database is small (i.e the
number of filtered transaction is small), however – as expected
– sorted lists do slow down as the database grows. RB-trees
are always faster than tries, but the difference is not
significant (it was always under 30%).</p>
      <p>In support count of APRIORI, each filtered transaction
has to be visited to determine the contained candidates. If
transactions are stored in a sorted list, then going through
the elements is a very fast operation. In the case of
RBtrees and trie, this operation is slower, since a tree has to
be traversed. However experiments showed that there is no
significant difference when compared to building the data
structure, so it does not merit serious consideration.</p>
      <p>Experiments showed that RB-tree is the best data
structure for storing filtered transactions. It needs little memory
and it is the fastest with regard construction time. But why
does RB-tree require less memory than trie, when trie stores
the same prefixes only once and former and RB-tree stores
them as many times as they appear in a filtered transaction?</p>
      <p>The answer to this comes from the fact that a trie has
many more nodes – therefore many more edges – than an
RB-tree (except for one bit per node, RB-trees need the
same amount of memory as simple binary trees need). In
a trie each node stores a counter and a list of edges. For
each edge we have to store the label and the identifier of
the node the edge points to. Thus adding a node to a trie
increases memory need by at least 5 · 4 bytes (if items and
pointers are represented in 4 bytes). In a binary tree, like
an RB-tree, the number of nodes equals to the number of
filtered transactions. Each node stores a filtered transaction
and its counter.</p>
      <p>When inserting the first k-itemset filtered transaction in
a trie, k nodes are created. However in an RB-tree we create
only one node. Although the same prefixes are stored only
once in a trie, this does not limit the memory increase as
much. This is the reason that a binary tree needs 3-10 times
less memory than a trie needs.</p>
    </sec>
    <sec id="sec-14">
      <title>5.2. Order of items</title>
      <p>To test the effect of ordering, we used 5 different orders:
ascending and descending order by support (first item has
the smallest/highest support) and three random orders. The
results with the random orders were always between the
results of the ascending and descending order. First the
construction times and the memory needs of FP-trees (without
cross links) were examined. Experiments showed that there
is little difference in the construction time whichever type of
ordering is used (the difference was always less than 8%).
As expected, memory need is greatly affected by the
ordering. Table 2 shows typical results.</p>
      <p>Experiments with FP-trees with different orders meet our
expectations: descending order leads to the smallest
memory need, while ascending order leads to the highest. This
agrees with the heuristic; a trie is expected to have small
number of nodes if the order of the items corresponds to the
descending order of supports.</p>
      <p>
        Our experiments drew the attention to the fact that the
memory need of FP-trees is greatly affected by the order
used. The difference between ascending and descending
order can be up to tenfold. In the basic FIM problem this
does not cause any trouble; we can use the descending
order. However in other FIM-related fields where the order
of the items cannot be chosen freely, this side-effect has
to be taken into consideration. Such fields are prefix
antimonotonic constraint based FIM algorithms, or FIM with
multiple support threshold. For example, to handle prefix
anti-monotonic constrains with FP-growth, we have to use
the order determined by the constraint [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. A naive
solution for handling constraints is to add post processing to
the FIM algorithm, where itemsets that do not return true
on all constraints are pruned. It can be more efficient if the
constraints are deeply embedded in the algorithm. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
it was shown that a single prefix anti-monotonic predicate
can be effectively treated by FP-growth. Our experiments
proved that FP-growth is very sensitive to the order of the
items. Consequently, embedding a prefix anti-monotonic
constraint into FP-growth does not trivially decrease
resource need. Although search space can be reduced, we
have seen that this may greatly increase memory need and
thus traversal time.
      </p>
      <p>
        Results on the effect of ordering in APRIORI are
surprising (Figure 6). They contradict our initial claim. In almost
all experiments APRIORI with the ascending order turned
out to be the fastest, and the one that used descending
ordering was the slowest. Highest difference (6 fold) was in
the case of retail.dat [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>These experiments support the previously stated, but
further unexplained observation, i.e. ascending order is the
best order to use in APRIORI. We have seen that a trie that
uses descending order is expected to have fewer nodes than
a trie that uses ascending order. However, ascending order
has two advantages over descending order. First, nodes have
ascending order
descending order
00.01
0.001</p>
      <p>Support threshold
fewer edges and hence the main step of support count
(finding the corresponding edges at a given node) is faster. The
second and more important factor is that a smaller number
of nodes is visited during the support count. This comes
from the fact that nodes near the root have edges with rare
frequent items as labels. This means that in the beginning
of the support count the most selective items are checked.
Many transactions do not contain rare frequent items, hence
the support count is terminated in the earliest phase. On the
contrary, when descending order is used, many edges are
visited before we get to the selective items. To illustrate
this fact, let us go back to Figure 2 and consider the task of
determining the candidates in transaction {A, B, F, G, H}.
Nodes 0,1,2 will be visited if descending order is used,
while the search will be terminated immediately at the root
in the case of the ascending order.</p>
      <p>Concerning memory need, as expected, descending
order is the best solution. However its advantage is
insignificant. APRIORI with ascending order never consumed more
than 2% extra memory compared to APRIORI with
descending order.</p>
    </sec>
    <sec id="sec-15">
      <title>5.3. Routing strategies at the nodes</title>
      <p>In the experiments of APRIORI with different
routing strategies, we tested 5 competitors: (1) simultaneous
traversal, (2-3) corresponding items were found by a
binary search on the items of the transaction/labels of the
edges, and (4-5) transactions were represented by
bitvectors/vectors of indices. The results can not be characterized
by a single table. Figure 7 and 8 present results with two
databases. For a more comprehensive account the reader is
referred to the aforementioned test web page.</p>
      <p>Our newly proposed routing technique (i.e. indexvector
based) almost always outperformed the other routing
strategies. Simultaneous traversal was the runner up. The other</p>
      <p>simultaneous traversal
binary search on the transaction
binary search on the edges</p>
      <p>vector of bits
vector of indices
Support threshold
0.001</p>
      <p>Support threshold
three alternated in performance. The bitvector approach
was most often the least effective. Routing that employed
binary search on the edges sometimes lead to extremely
bad result,but on one database (retail) it finished in first
place.</p>
      <p>Simultaneous traversal performed very well, and almost
always beat the binary search approaches. Theoretical
runtimes do not necessarily support this. To understand why
simultaneous traversal outperformed the binary search based
approaches, we have to take a closer look at them.</p>
      <p>Simultaneous traversal is a very simple method; it
compares the values of the pointers and increments one or both
of them. These elementary operations are very fast. In
binary search approach we increment pointers and invoke
binary searches. In each binary search we make comparisons,
additions and even divisions. If we take into consideration
that calling a subroutine with parameters always means
extra value assignments, then we can conclude the overhead
of the binary search is significant and is not profitable, if
the list we are searching through is short. In our case
neither the number of edges of the nodes nor the number of
frequent items in the transaction is large. This explains the
bad performance of binary search in our case.</p>
      <p>By using a binary vector we can avoid binary search
with all of its overhead. However, experiments showed
that although the bitvector based approach was better than
binary search-based approaches, it could not outperform
the simultaneous traversal. This is because a
bitvectorbased approach does not take into consideration that only
a part of the transaction has to be examined. Let us see
an example. Assume that the only 4-itemset candidate is
{D, E, F, G} and we have to find the candidates in
transaction {A, B, C, D, E, F }. Except for the bitvector-based
approach all the techniques considered will not visit any node
in the trie, because there is no edge of the root whose
label corresponds to any of the first 6 − 4 + 1 = 3 items
in the transaction. On the contrary, the bitvector-based
approach uses the whole transaction and starts with a
superfluous travel that goes down even to depth 3. A vector that
stores the indices (the 5th competitor) overcomes this
deficiency. This seems to be the reason behind the good
performance (first place most of the time).</p>
    </sec>
    <sec id="sec-16">
      <title>5.4. Storing the frequent itemsets</title>
      <p>Figure 9 shows typical run-times of three different
variants of APRIORI. In the first and second frequent itemset
were stored in the memory, in the third they were written to
disk. To avoid superfluous traversing, maximum path
values were used in the first APRIORI, and different lists of
edges in the second.</p>
      <p>Memory needs of the three implementations are found in
the next table.</p>
      <p>As expected, we obtain the fastest APRIORI if frequent
itemsets are not stored in memory but written to disk in the
0.01</p>
      <p>Support threshold
candidate generation process. Experiments show that this
APRIORI, however, is not significantly faster than
APRIORI that stores maximum path lengths. This comes from
the fact that APRIORI spends most of the time in
determining the support of small and medium sized candidates. In
such cases most edges lead to leaves, hence removing other
edges does not accelerate the algorithm too much.</p>
      <p>However, the memory need can be significantly reduced
if frequent itemsets are not stored in memory. Experiments
show that memory need may even decrease to the third or
the quarter. Consequently if frequent itemsets are needed
to determine valid association rules and memory
consumption is an important factor, then it is useful to write frequent
itemsets to disk and read them back when the association
rule discovery phase starts.</p>
    </sec>
    <sec id="sec-17">
      <title>5.5. Deleting unimportant transactions</title>
      <p>Test results for deleting unimportant transactions
contradict our expectations. Typical run-times are listed in Table
4.</p>
      <p>min freq (%)
non-delete
delete
90
4.76
4.48</p>
      <p>81 75
11.82 65.1
12.09 66.5
Database: pumsb
71
430
442
69
905
917</p>
      <p>67
1301
1339
the difference was insignificant (under 1%). The trick
accelerated the algorithm only for the retail database.</p>
      <p>Deleting unimportant transactions was expected to
decrease run-time. However test results showed the contrary.
This is attributed two extra cost factors that come into play
as soon as we want to delete unimportant transactions.</p>
      <p>First, we have to determine if a given transaction
contains any candidates. This means some overhead (one
assignment at each elementary step of the support count) and
does not save time during APRIORI in cases where the
transaction contains candidates (which is the typical case).</p>
      <p>The second kind of extra cost comes from the fact that
filtered transactions were stored in an RB-tree. For this we
used map implemented in STL. However deleting an entry
from an RB-tree is not as cheap as deleting a leaf from a
simple binary tree. The red-black property has to be
maintained which sometimes requires the expensive rotation
operation. Notice that after determining the multiplicity of the
filtered transactions we don’t need to maintain any
sophisticated data structures, only the filtered transactions and the
multiplicity values are needed for the support count.
Consequently, the second extra cost problem can be overcome
in two ways. We can copy the filtered transactions and the
counters of the RB-tree into a list or we may let the delete
operations invalidate the red-black property of the tree. Test
results showed that even with these modifications, deleting
unimportant transactions does not lead to a faster APRIORI.</p>
    </sec>
    <sec id="sec-18">
      <title>5.6. Overall performance gain</title>
      <p>With our last experiment we would like to illustrate the
overall performance gain of the prospective improvements.
Two APRIORI implementations with different trie related
options are compared. In the first ascending order,
simultaneous traversal is used and filtered transactions are stored
in an RB-tree. In the second implementation descending
order, binary search on the edges is applied and filtered
transactions are not collected.
min freq (%)
original
new
90 83 75 71 67 65.5
213 2616 16315 34556 71265 stopped
3.5 9.3 66 158 365 706</p>
      <p>Database: connect</p>
      <p>In six out of ten databases, deleting unimportant
transaction slowed down APRIORI. In the other three databases
The results support our claim, that suitable data structure
techniques lead to a remarkable improvements.</p>
    </sec>
    <sec id="sec-19">
      <title>5.7. Effect of programming techniques</title>
      <p>
        Most papers on FIM focus on new algorithms, new
candidate generation methods, support count techniques or data
structure-related issues. Less attention is paid to details,
like the ones mentioned above, despite the fact that these
tricks are able to speed up algorithms that are not regarded
as being very fast. The fastest APRIORI implementation
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which incorporates many sophisticated techniques,
supports this claim by finishing among the best
implementations (beating many newer algorithms) in the first FIM
contest [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Besides the algorithmic and data-structure issues there is
a third factor that quite possibly influences effectiveness.
This factor is programming technique. FIM algorithms
are computationally expensive and therefore, no algorithms
that are implemented in a high level programming language
(like Java, C#) are competitive with lower level
implementations. The language C is widely used, flexible and
effective, which is the reason why every competitive FIM
implementation is implemented in C.</p>
      <p>Unfortunately, C gives too much freedom to the
implementors. Elementary operations like binary search, building
different trees, list operations, memory-allocation, etc. can
be done in many ways. This is a disadvantage because
efficiency depends on the way these elementary operations are
programmed. This may also be the reason for the
performance gain of a FIM implementation. An experienced
programmer is more likely to code a fast FIM algorithm than a
FIM expert with less programming experience.</p>
      <p>A tool that provides standard procedures for the
elementary operations has double advantage. Efficiency of
the implementation would only depend on the algorithm
itself. The code would be more readable and
maintainable because of the higher level of abstraction. Such a tool
exists – it is the C++ and the Standard Template Library
(STL). STL provides containers (general data structures),
algorithms and functions that were carefully programmed
by professional programmers. By using STL the code will
be easier to read and less prone to error, while
maintaining efficiency (STL algorithms are asymptotically optimal).
Due to these advantages, STL should be used whenever
possible. Actually it could be the “common language” among
data mining programmers.</p>
      <p>
        Besides the aforementioned advantages of STL, it also
introduces some dangers. To make good use of STL’s
capabilities, we first need to have more advanced knowledge
about them. Our experiments on STL-related issues showed
that small details and small changes can lead to high
variations in run-time or memory need. The goal of this paper is
to draw attention to data-structure related issues, however,
we also have to mention some STL-related issues that have
to be taken into careful consideration. Next, we list some of
the factors we ran into that have a large impact on efficiency.
These are: (1) when to use sorted vector instead of an
RBtree (i.e. sorted vector vs. map), (2) when to a store
pointers instead of objects in a container (double
referencing vs. copy constructor), (3) memory management of the
containers and the need for using the “swap trick” to avoid
unnecessary memory occupation, (4) contiguous-memory
containers vs. node-based containers, (5) iterators vs.
using the index operator, etc. These issues are important. For
example, when a vector storing pointers of objects was
substituted by a vector that stores simply the objects and copy
constructor overhead was avoided by reserving memory in
advance, the run-time decreased significantly (for example
to one third in the case of the T10I4D100K database). For
more information on STL-related questions, the reader is
referred to [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
    </sec>
    <sec id="sec-20">
      <title>6. APRIORI implementation submitted to</title>
    </sec>
    <sec id="sec-21">
      <title>FIMI’04</title>
      <p>
        Based on our theoretical and experimental analysis, we
implemented a fast APRIORI algorithm. Since we believe
that readability is also very important we sacrificed an
insignificant amount of efficiency if it led to a simpler code.
Our final implementation uses a red-black tree to store
filtered transactions, item order is ascending according to their
support, simultaneous traversal is used as a routing
strategy, nodes representing frequent itemsets, but playing no
role in support count, are pruned and unimportant filtered
transactions are not removed. Moreover, a single vector
and an array is used to quickly find the support of one
and two itemset candidates [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The implementation
(version 2.4.1 at the time of writing) is fully documented
and can be freely downloaded for research purposes at
http://www.cs.bme.hu/˜bodon/en/apriori.
      </p>
      <p>A version that uses a simplified output format and
configuration options was submitted to the FIMI’04 contest.</p>
    </sec>
    <sec id="sec-22">
      <title>7. Conclusion</title>
      <p>In this paper we analyzed speed-up techniques of
triebased algorithms. Our main target was the algorithm
APRIORI, however, some trie-related issues also apply to other
algorithms like FP-growth. We also presented new
techniques that result in a faster APRIORI.</p>
      <p>Experiments proved that these data-structure issues
greatly affect the run-time and memory consumption of the
algorithms. A carefully chosen combination of these
techniques can lead to a 2 to 1000 fold decrease in run-time,
without significantly increasing memory consumption.</p>
      <p>The author would like to thank Bala´zs Ra´cz for his
valuable STL and programming related comments and for
insightful discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. V. V.</given-names>
            <surname>Prasad</surname>
          </string-name>
          .
          <article-title>A tree projection algorithm for generation of frequent item sets</article-title>
          .
          <source>Journal of Parallel and Distributed Computing</source>
          ,
          <volume>61</volume>
          (
          <issue>3</issue>
          ):
          <fpage>350</fpage>
          -
          <lpage>371</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Swami</surname>
          </string-name>
          .
          <article-title>Mining association rules between sets of items in large databases</article-title>
          .
          <source>Proceedings of the ACM SIGMOD Conference on Management of Data</source>
          , pages
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>The International Conference on Very Large Databases</source>
          , pages
          <fpage>487</fpage>
          -
          <lpage>499</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Mining sequential patterns</article-title>
          . In P. S. Yu and
          <string-name>
            <surname>A. L. P.</surname>
          </string-name>
          Chen, editors,
          <source>Proceedings of the 11th International Conference on Data Engineering, ICDE</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          . IEEE Computer Society,
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          1995.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bodon</surname>
          </string-name>
          .
          <article-title>A fast apriori implementation</article-title>
          . In B. Goethals and M. J. Zaki, editors,
          <source>Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03)</source>
          , volume
          <volume>90</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Melbourne, Florida, USA, 19.
          <source>November</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Borgelt</surname>
          </string-name>
          .
          <article-title>Efficient implementations of apriori and eclat</article-title>
          . In B. Goethals and M. J. Zaki, editors,
          <source>Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03)</source>
          , volume
          <volume>90</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Melbourne, Florida, USA, 19.
          <source>November</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brijs</surname>
          </string-name>
          , G. Swinnen,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vanhoof</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Wets</surname>
          </string-name>
          .
          <article-title>Using association rules for product assortment decisions: A case study</article-title>
          .
          <source>In Proceedings of the sixth International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>260</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Comer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Sethi</surname>
          </string-name>
          .
          <article-title>The complexity of trie index construction</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>24</volume>
          (
          <issue>3</issue>
          ):
          <fpage>428</fpage>
          -
          <lpage>440</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] R. de la Briandais.
          <article-title>File searching using variable-length keys</article-title>
          .
          <source>In Western Joint Computer Conference</source>
          , pages
          <fpage>295</fpage>
          -
          <lpage>298</lpage>
          ,
          <year>March 1959</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fredkin</surname>
          </string-name>
          .
          <article-title>Trie memory</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>3</volume>
          (
          <issue>9</issue>
          ):
          <fpage>490</fpage>
          -
          <lpage>499</lpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          .
          <article-title>Advances in frequent itemset mining implementations: Introduction to fimi03</article-title>
          . In B. Goethals and M. J. Zaki, editors,
          <source>Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03)</source>
          , volume
          <volume>90</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Melbourne, Florida, USA, 19.
          <source>November</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation</article-title>
          .
          <source>In Proceedings of the 2000 ACM SIGMOD international conference on Management of data</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . ACM Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Inokuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Washio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Motoda</surname>
          </string-name>
          .
          <article-title>An apriori-based algorithm for mining frequent substructures from graph data</article-title>
          .
          <source>In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery</source>
          , pages
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          . Springer-Verlag,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuramochi</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Frequent subgraph discovery</article-title>
          .
          <source>In Proceedings of the first IEEE International Conference on Data Mining</source>
          , pages
          <fpage>313</fpage>
          -
          <lpage>320</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          .
          <article-title>Discovering generalized episodes using minimal occurrences</article-title>
          .
          <source>In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96)</source>
          , pages
          <fpage>146</fpage>
          -
          <lpage>151</lpage>
          . AAAI Press,
          <year>August 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Inkeri</given-names>
            <surname>Verkamo</surname>
          </string-name>
          .
          <article-title>Discovering frequent episodes in sequences</article-title>
          .
          <source>In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95)</source>
          , pages
          <fpage>210</fpage>
          -
          <lpage>215</lpage>
          . AAAI Press,
          <year>August 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Meyers</surname>
          </string-name>
          . Effective STL:
          <article-title>50 specific ways to improve your use of the standard template library. Addison-Wesley Longman Ltd</article-title>
          .,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>B. O</surname>
          </string-name>
          <article-title>¨ zden, S. Ramaswamy, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          .
          <article-title>Cyclic association rules</article-title>
          .
          <source>In Proceedings of the Fourteenth International Conference on Data Engineering</source>
          , pages
          <fpage>412</fpage>
          -
          <lpage>421</lpage>
          . IEEE Computer Society,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          , J. Han, and
          <string-name>
            <given-names>L. V. S.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          .
          <article-title>Mining frequent item sets with convertible constraints</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Data Engineering</source>
          , pages
          <fpage>433</fpage>
          -
          <lpage>442</lpage>
          . IEEE Computer Society,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rotwitt</surname>
          </string-name>
          , Jr. and
          <string-name>
            <surname>P. A. D. de Maine</surname>
          </string-name>
          .
          <article-title>Storage optimization of tree structured files</article-title>
          . In
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Codd</surname>
          </string-name>
          and
          <string-name>
            <surname>A. L</surname>
          </string-name>
          . Dean, editors,
          <source>Proceedings of 1971 ACM-SIGFIDET Workshop on Data Description, Access and Control</source>
          , pages
          <fpage>207</fpage>
          -
          <lpage>217</lpage>
          . ACM, November
          <volume>11</volume>
          -12
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Severance</surname>
          </string-name>
          .
          <article-title>Identifier search mechanisms: A survey and generalized model</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ):
          <fpage>175</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Stroustrup</surname>
          </string-name>
          . The C+
          <article-title>+ Programming Language, Special Edition</article-title>
          . Addison-Wesley Verlag, Bosten,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Yao</surname>
          </string-name>
          .
          <article-title>Tree structures construction using key densities</article-title>
          .
          <source>In Proceedings of the 1975 annual conference</source>
          , pages
          <fpage>337</fpage>
          -
          <lpage>342</lpage>
          . ACM Press,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          .
          <article-title>Efficiently mining frequent trees in a forest</article-title>
          .
          <source>In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          . ACM Press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>M. J. Zaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ogihara</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>New algorithms for fast discovery of association rules</article-title>
          . In D. Heckerman,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pregibon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Uthurusamy</surname>
          </string-name>
          , and M. Park, editors,
          <source>Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>283</fpage>
          -
          <lpage>296</lpage>
          . AAAI Press,
          <fpage>12</fpage>
          -
          <lpage>15</lpage>
          1997.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>