<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A fast APRIORI implementation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ferenc Bodon</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Informatics Laboratory, Computer and Automation Research Institute, Hungarian Academy of Sciences H-1111 Budapest</institution>
          ,
          <addr-line>La ́gyma ́nyosi u. 11</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation details are almost always neglected. In this paper we show that the effect of implementation can be more important than the selection of the algorithm. Ideas that seem to be quite promising, may turn out to be ineffective if we descend to the implementation level. We theoretically and experimentally analyze APRIORI which is the most established algorithm for frequent itemset mining. Several implementations of the algorithm have been put forward in the last decade. Although they are implementations of the very same algorithm, they display large differences in running time and memory need. In this paper we describe an implementation of APRIORI that outperforms all implementations known to us. We analyze, theoretically and experimentally, the principal data structure of our solution. This data structure is the main factor in the efficiency of our implementation. Moreover, we present a simple modification of APRIORI that appears to be faster than the original algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Finding frequent itemsets is one of the most investigated
fields of data mining. The problem was first presented in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The subsequent paper [3] is considered as one of the
most important contributions to the subject. Its main
algorithm, APRIORI, not only influenced the association rule
mining community, but it affected other data mining fields
as well.
      </p>
      <p>
        Association rule and frequent itemset mining became a
widely researched area, and hence faster and faster
algo∗Research supported in part by OTKA grants T42706, T42481 and the
EU-COE Grant of MTA SZTAKI.
rithms have been presented. Numerous of them are
APRIORI based algorithms or APRIORI modifications. Those
who adapted APRIORI as a basic search strategy, tended
to adapt the whole set of procedures and data structures
as well [
        <xref ref-type="bibr" rid="ref21">20</xref>
        ][
        <xref ref-type="bibr" rid="ref9">8</xref>
        ][
        <xref ref-type="bibr" rid="ref22">21</xref>
        ][
        <xref ref-type="bibr" rid="ref27">26</xref>
        ]. Since the scheme of this
important algorithm was not only used in basic association rules
mining, but also in other data mining fields
(hierarchical association rules [
        <xref ref-type="bibr" rid="ref23">22</xref>
        ][
        <xref ref-type="bibr" rid="ref17">16</xref>
        ][
        <xref ref-type="bibr" rid="ref12">11</xref>
        ], association rules
maintenance [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ][
        <xref ref-type="bibr" rid="ref11">10</xref>
        ][
        <xref ref-type="bibr" rid="ref25">24</xref>
        ][
        <xref ref-type="bibr" rid="ref6">5</xref>
        ], sequential pattern mining [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ][
        <xref ref-type="bibr" rid="ref24">23</xref>
        ],
episode mining [
        <xref ref-type="bibr" rid="ref19">18</xref>
        ] and functional dependency discovery
[
        <xref ref-type="bibr" rid="ref15 ref3">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref16">15</xref>
        ]), it seems appropriate to critically examine the
algorithm and clarify its implementation details.
      </p>
      <p>
        A central data structure of the algorithm is trie or
hashtree. Concerning speed, memory need and sensitivity of
parameters, tries were proven to outperform hash-trees [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ].
In this paper we will show a version of trie that gives the
best result in frequent itemset mining. In addition to
description, theoretical and experimental analysis, we provide
implementation details as well.
      </p>
      <p>The rest of the paper is organized as follows. In Section
1 the problem is presented, in Section 2 tries are described.
Section 3 shows how the original trie can be modified to
obtain a much faster algorithm. Implementation is detailed
in Section 4. Experimental details are given in Section 5. In
Section 7 we mention further improvement possibilities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Statement</title>
      <p>Frequent itemset mining came from efforts to discover
useful patterns in customers’ transaction databases. A
customers’ transaction database is a sequence of transactions
(T = t1, . . . , tn ), where each transaction is an itemset
(ti ⊆ I). An itemset with k elements is called a k-itemset.
In the rest of the paper we make the (realistic) assumption
that the items are from an ordered set, and transactions are
stored as sorted itemsets. The support of an itemset X in T,
denoted as suppT(X ), is the number of those transactions
that contain X , i.e. suppT(X ) = |{tj : X ⊆ tj }|. An
itemset is frequent if its support is greater than a support
threshold, originally denoted by min supp. The frequent
itemset mining problem is to find all frequent itemset in a
given transaction database.</p>
      <p>
        The first, and maybe the most important solution for
finding frequent itemsets, is the APRIORI algorithm [3].
Later faster and more sophisticated algorithms have been
suggested, most of them being modifications of APRIORI
[
        <xref ref-type="bibr" rid="ref21">20</xref>
        ][
        <xref ref-type="bibr" rid="ref9">8</xref>
        ][
        <xref ref-type="bibr" rid="ref22">21</xref>
        ][
        <xref ref-type="bibr" rid="ref27">26</xref>
        ]. Therefore if we improve the APRIORI
algorithm then we improve a whole family of algorithms. We
assume that the reader is familiar with APRIORI [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and we
turn our attention to its central data structure.
3. Determining Support with a Trie
      </p>
      <p>
        The data structure trie was originally introduced to store
and efficiently retrieve words of a dictionary (see for
example [
        <xref ref-type="bibr" rid="ref18">17</xref>
        ]). A trie is a rooted, (downward) directed tree like a
hash-tree. The root is defined to be at depth 0, and a node
at depth d can point to nodes at depth d + 1. A pointer is
also called edge or link, which is labeled by a letter. There
exists a special letter * which represents an ”end” character.
If node u points to node v, then we call u the parent of v,
and v is a child node of u.
      </p>
      <p>Every leaf represents a word which is the concatenation
of the letters in the path from the root to . Note that if the
first k letters are the same in two words, then the first k steps
on their paths are the same as well.</p>
      <p>Tries are suitable to store and retrieve not only words,
but any finite ordered sets. In this setting a link is labeled
by an element of the set, and the trie contains a set if there
exists a path where the links are labeled by the elements of
the set, in increasing order.</p>
      <p>In our data mining context the alphabet is the (ordered)
set of all items I. A candidate k-itemset</p>
      <p>C = {i1 &lt; i2 &lt; . . . &lt; ik}
can be viewed as the word i1i2 . . . ik composed of letters
from I. We do not need the * symbol, because every
inner node represent an important itemset (i.e. a meaningful
word).</p>
      <p>
        Figure 1 presents a trie that stores the
candidates {A,C,D}, {A,E,G}, {A,E,L}, {A,E,M}, {K,M,N}.
Numbers in the nodes serve as identifiers and will be used
in the implementation of the trie. Building a trie is
straightforward, we omit the details, which can be found in [
        <xref ref-type="bibr" rid="ref18">17</xref>
        ].
      </p>
      <p>In support count method we take the transactions
oneby-one. For a transaction t we take all ordered k-subsets X
of t and search for them in the trie structure. If X is found
(as a candidate), then we increase the support count of this
candidate by one. Here, we do not generate all k-subsets of
t, rather we perform early quits if possible. More precisely,
if we are at a node at depth d by following the j th item link,
then we move forward on those links that have the labels
i ∈ t with index greater than j, but less than |t| − k + d + 1.
D</p>
      <p>C
2
3
E
4
1
G
5</p>
      <p>K
M
6
7
L</p>
      <p>M
10
8
N
9</p>
      <p>
        In our approach, tries store not only candidates, but
frequent itemsets as well. This has the following advantages:
1. Candidate generation becomes easy and fast. We can
generate candidates from pairs of nodes that have the
same parents (which means, that except for the last
item, the two sets are the same).
2. Association rules are produced much faster, since
retrieving a support of an itemset is quicker (remember
the trie was originally developed to quickly decide if a
word is included in a dictionary).
3. Just one data structure has to be implemented, hence
the code is simpler and easier to maintain.
4. We can immediately generate the so called
negative border, which plays an important role in many
APRIORI-based algorithm (online association rules
[
        <xref ref-type="bibr" rid="ref26">25</xref>
        ], sampling based algorithms [
        <xref ref-type="bibr" rid="ref27">26</xref>
        ], etc.).
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Support Count Methods with Trie</title>
      <p>Support count is done, by reading transactions
one-byone and determine which candidates are contained in the
actual transaction (denoted by t). Finding candidates in a
given transaction determines the overall running time
primarily. There are two simple recursive methods to solve
this task, both starts from the root of the trie. The recursive
step is the following (let us denote the number of edges of
the actual node by m).</p>
      <p>1. For each item in the transaction we determine whether
there exists an edge whose label corresponds to the
item. Since the edges are ordered according to the
labels this means a search in an ordered set.
2. We maintain two indices, one for the items in the
transaction, one for the edges of the node. Each index is
initialized to the first element. Next we check whether the
element pointed by the two indices equals. If they do,
B</p>
      <p>C</p>
      <p>D</p>
      <p>C D</p>
      <p>E D</p>
      <p>E
B C</p>
      <p>E
E</p>
      <p>E
C D</p>
      <p>E D</p>
      <p>E</p>
      <p>D</p>
      <p>E</p>
      <p>E E
D E</p>
      <p>E</p>
      <p>E</p>
      <p>E
E
we call our procedure recursively. Otherwise we
increase the index that points to the smaller item. These
steps are repeated until the end of the transaction or the
last edge is reached.</p>
      <p>In both methods if item i of the transaction leads to a
new node, then item j is considered in the next step only
if j &gt; i (more precisely item j is considered, if j &lt;
|t| + actual depth − m + 1).</p>
      <p>Let us compare the running time of the methods. Since
both methods are correct, the same branches of the trie will
be visited. Running time difference is determined by the
recursive step. The first method calls the subroutine that
decides whether there exist an edge with a given label |t|
times. If binary search is evoked then it requires log 2 m
steps. Also note that subroutine calling needs as many value
assignments as many parameters the subroutine has. We can
easily improve the first approach. If the number of edges
is small (i.e. if |t|m &lt; m|t|) we can do the inverse
procedure, i.e. for all labels we check whether there exists a
corresponding item. This way the overall running time is
proportional to min{|t| log2 m, m log2 |t|}.</p>
      <p>The second method needs at least min{m, |t|} and at
most m + |t| steps and there is no subroutine call.</p>
      <p>Theoretically it can happen that the first method is the
better solution (for example if |t|=1, m is large, and the
label of the last edge corresponds to the only item in the
transaction), however in general the second method is faster.
Experiments showed that the second method finished 50%
faster on the average.</p>
      <p>Running time of support count can be further reduced if
we modify the trie a little. These small modifications, tricks
are described in the next subsections.
3.2</p>
      <p>Storing the Length of Maximal Paths</p>
      <p>Here we show how the time of finding supported
candidates in a transaction can be significantly reduced by
storing a little extra information. The point is that we often
perform superfluous moves in trie search in the sense that
there are no candidates in the direction we are about to
explore. To illustrate this, consider the following example.
Assume that after determining frequent 4-itemsets only
candidate {A, B, C, D, E} was generated, and Figure 2 shows
the resulting trie.</p>
      <p>If we search for 5-itemset candidates supported by the
transaction {A, B, C, D, E, F, G, H, I}, then we must visit
every node of the trie. This appears to be unnecessary since
only one path leads to a node at depth 5, which means that
only one path represents a 5-itemset candidate. Instead of
visiting merely 6 nodes, we visit 32 of them. At each node,
we also have to decide which link to follow, which can
greatly affect running time if a node has many links.</p>
      <p>To avoid this superfluous traveling, at every node we
store the length of the longest directed path that starts from
there. When searching for k-itemset candidates at depth d,
we move downward only if the maximal path length at this
node is at least k − d. Storing counters needs memory, but
as experiments proved, it can seriously reduce search time
for large itemsets.
3.3</p>
    </sec>
    <sec id="sec-4">
      <title>Frequency Codes</title>
      <p>It often happens that we have to find the node that
represents a given itemset. For example, during candidate
generation, when the subsets of the generated candidate have
to be checked, or if we want to obtain the association rules.
Starting from the root we have to travel, and at depth d we
have to find the edge whose label is the same as the dth
element of the itemset.</p>
      <p>Theoretically, binary search would be the fastest way to
find an item in an ordered list. But if we go down to the
implementation level, we can easily see that if the list is
small the linear search is faster than the binary (because an
iteration step is faster). Hence the fastest solution is to
apply linear search under a threshold and binary above it. The
threshold does not depend on the characteristic of the data,
but on the ratio of the elementary operations (value
assignment, increase, division, . . . )</p>
      <p>In linear search, we read the first item, compare with the
searched item. If it is smaller, then there is no edge with
this item, if greater, we step forward, if they equal then the
search is finished. If we have bad luck, the most frequent
item has the highest order, and we have to march to the end
of the line whenever this item is searched for.</p>
      <p>On the whole, the search will be faster if the order of
items corresponds to the frequency order of them. We know
exactly the frequency order after the first read of the whole
database, thus everything is at hand to build the trie with
the frequency codes instead of the original codes. The
frequency code of an item i is f c[i], if i is the f c[i]th most
frequent item. Storing frequency codes and their inverses
increases the memory need slightly, in return it increases
the speed of retrieving the occurrence of the itemsets. A
theoretical analysis of the improvement can be read in the
Appendix.</p>
      <p>Frequency codes also affect the structure of the trie, and
consequently the running time of support count. To
illustrate this, let us suppose that 2 candidates of size 3 are
generated: {A, B, C}, {A, B, D}. Different tries are generated
if the items have different code. The next figure present the
tries generated by two different coding.</p>
      <p>If we want to find which candidates are stored in a basket
{A, B, C, D} then 5 nodes are visited in the first case and 7
in the second case. That does not mean that we will find the
candidates faster in the first case, because nodes are not so
”fat” in the second case, which means, that they have fewer
edges. Processing a node is faster in the second case, but
more nodes have to be visited.</p>
      <p>In general, if the codes of the item correspond to the
frequency code, then the resulted trie will be unbalanced while
in the other case the trie will be rather balanced. Neither
is clearly more advantageous than the other. We choose
to build unbalanced trie, because it travels through fewer
nodes, which means fewer recursive steps which is a slow
operation (subroutine call with at least five parameters in
our implementation) compared to finding proper edges at a
node.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ] it was showed that it is advantageous to recode
frequent items according to ascending order of their
frequencies (i.e.: the inverse of the frequency codes) because
candidate generation will be faster. The first step of
candidate generation is to find siblings and take the union of the
itemset represented by them. It is easy to prove that there
are less sibling relations in a balanced trie, therefore less
unions are generated and the second step of the candidate
generation is evoked fewer times. For example in our figure
one union would be generated and then deleted in the first
case and none would be generated in the second.
      </p>
      <p>Altogether frequency codes have advantages and
disadvantages. They accelerate retrieving the support of an
itemset which can be useful in association rule generation or in
on-line frequent itemset mining, but slows down candidate
generation. Since candidate generation is by many order of
magnitude faster that support count, the speed decrease is
not noticeable.
3.4</p>
      <p>Determining the support of 1- and 2-itemset
candidates</p>
      <p>
        We already mentioned that the support count of
1element candidates can easily be done by a single vector.
The same easy solution can be applied to 2-itemset
candidates. Here we can use an array [
        <xref ref-type="bibr" rid="ref20">19</xref>
        ]. The next figure
illustrates the solutions.
      </p>
      <p>array[f c[i]][f c[j] − f c[i]]
Note that this solution is faster than the trie-based
solution, since increasing a support takes one step. Its second
advantage is that it needs less memory.</p>
      <p>
        Memory need can be further reduced by applying
onthe-fly candidate generation [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ]. If the number of frequent
items is |L1| then the number of 2-itemset candidates is
|L1| , out of which a lot will never occur. Thus instead of
2
the array, we can use trie. A 2-itemset candidate is inserted
only if it occurs in a basket.
      </p>
      <p>Determining the support with a trie is relatively slow
when we have to step forward from a node that has many
edges. We can accelerate the search if hash-tables are
employed. This way the cost of a step down is calculating one
hash value, thus a recursive step takes exactly |t| steps. We
want to keep the property that a leaf represents exactly one
itemset, hence we have to use perfect hashing. Frequency
codes suit our needs again, because a trie stores only
frequent items. Please note, that applying hashing techniques
at tries does not result in a hash-tree proposed in [3].</p>
      <p>It is wasteful to change all inner nodes to hash-table,
since a hash-table needs much more memory than an
ordered list of edges. We propose to alter only those
inner nodes into a hash-table, which have more edges than
a reasonable threshold (denoted by leaf max size).
During trie construction when a new leaf is added, we have
to check whether the number of its parent’s edges exceeds
leaf max size. If it does, the node has to be altered to
hash-table. The inverse of this transaction may be needed
when infrequent itemsets are deleted.</p>
      <p>If the frequent itemsets are stored in the trie, then the
number of edges can not grow as we go down the trie. In
practice nodes, at higher level have many edges, and nodes
at low level have only a few. The structure of the trie will
be the following: nodes over a (not necessarily a
horizontal) line will be hash-tables, while the others will be
original nodes. Consequently, where it was slow, search will be
faster, and where it was fast –because of the small number
of edges– it will remain fast.
3.6</p>
    </sec>
    <sec id="sec-5">
      <title>Brave Candidate Generation</title>
      <p>It is typical, that in the last phases of APRIORI there
are a small number of candidates. However, to determine
their support the whole database is scanned, which is
wasteful. This problem was also mentioned in the original
paper of APRIORI [3], where algorithm APRIORI-HYBRID
was proposed. If a certain heuristic holds, then APRIORI
switches to APRIORI-TID, where for each candidate we
store the transactions that contain it (and support is
immediately obtained). This results in a much faster execution of
the latter phases of APRIORI.</p>
      <p>The hard point of this approach is to tell when to switch
from APRIORI to APRIORI-TID. If the heuristics fails the
algorithm may need too much memory and becomes very
slow. Here we choose another approach that we call the
brave candidate generation and the algorithm is denoted by
APRIORI-BRAVE.</p>
      <p>APRIORI-BRAVE keeps track of memory need, and
stores the amount of the maximal memory need. After
determining frequent k-itemsets it generates (k + 1)-itemset
candidates as APRIORI does. However, it does not carry
on with the support count, but checks if the memory need
exceeds the maximal memory need. If not, (k + 2)-itemset
candidates are generated, otherwise support count is evoked
and maximal memory need counter is updated. We carry on
with memory check and candidate generation till memory
need does not reach maximal memory need.</p>
      <p>This procedure will collect together the candidates of the
latter phases and determine their support in one database
read. For example, candidates in database T40I10D100K
with min supp = 0.01 will be processed the following
way: 1, 2, 3, 4-10, 11-14, which means 5 database scan
instead of 14.</p>
      <p>One may say that APRIORI-BRAVE does not consume
more memory than APRIORI and it can be faster,
because there is a possibility that some candidates at different
sizes are collected and their support is determined in one
read. However, accelerating in speed is not necessarily true.
APRIORI-BRAVE may generate (k +2)-itemset candidates
from frequent k-itemset, which can lead to more false
candidates. Determining support of false candidates needs
time. Consequently, we cannot guarantee that
APRIORIBRAVE is faster than APRIORI, however, test results prove
that this heuristics works well in real life.
3.7</p>
      <p>A Deadend Idea- Support Lower Bound</p>
      <p>
        The false candidates problem in APRIORI-BRAVE can
be avoided if only those candidates are generated that are
surely frequent. Fortunately, we can give a lower bound
to the support of a (k + j)-itemset candidate based on the
support of k-itemsets (j ≥ 0) [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]. Let X = X ∪ Y ∪ Z.
The following inequality holds:
supp(X ) ≥ supp(X ∪ Y ) + supp(X ∪ Z) − supp(X ).
And hence
supp(X ) ≥ Y,Z∈X
max {supp(X \ Y ) + supp(X \ Z)
− supp(X \ Y \ Z)}.
      </p>
      <p>If we want to give a lower bound to a support of (k +
j)itemset base on support of k-itemset, we can use the
generalization of the above inequality (X = X ∪ x1 ∪ . . . ∪ xj ):
supp(X ) ≥ supp(X ∪ x1) + . . . + supp(X ∪ xj )
− (j − 1)supp(X ).</p>
      <p>To avoid false candidate generation we generate only
those candidates that are surely frequent. This way, we
could say that neither memory need nor running time is
worse than APRIORI’s. Unfortunately, this is not true!</p>
      <p>Test results proved that this method not only slower than
original APRIORI-BRAVE, but also APRIORI outperforms
it. The reason is simple. Determining the support threshold
itemsets) and has to be executed many times. It loseskkm−+o1jrej
is a slow operation (we have to find the support of
time with determining support thresholds than we win by
generating sooner some candidates.</p>
      <p>The failure of the support-threshold candidate generation
is a nice example when a promising idea turns out to be
useless at the implementation level.
3.8</p>
    </sec>
    <sec id="sec-6">
      <title>Storing input</title>
      <p>Many papers in the frequent itemset mining subject
focus on the number of the whole database scan. They say
that reading data from disc is much slower than operating
in memory, thus the speed is mainly determined by this
factor. However, in most cases the database is not so big and it
fits into the memory. Behind the scenery the operating
system swaps it in the memory and the algorithms read the disc
only once. For example, a database that stores 10 7
transaction, and in each transaction there are 6 items on the
average needs approximately 120Mbytes, which is a fraction of
today’s average memory capacity. Consequently, if we
explicitly store the simple input data, the algorithm will not
speed up, but will consume more memory (because of the
double storing), which may result in swapping and slowing
down. Again, if we descend to the elementary operation of
an algorithm, we may conclude the opposite result.</p>
      <p>Storing the input data is profitable if the same
transactions are gathered up. If a transaction occurs times, the
support count method is evoked once with counter
increment , instead of calling the procedure times with counter
increment 1. In Borgelt algorithm input is stored in a prefix
tree. This is dangerous, because data file can be too large.</p>
      <p>
        We’ve chosen to store only reduced transactions. A
reduced transaction stores only the frequent items of the
transaction. Storing reduced transactions have all information
needed to discover frequent itemsets of larger sizes, but it
is expected to need less memory (obviously it depends on
min supp). Reduced transactions are stored in a tree for
the fast insertion (if reduced transactions are recode with
frequency codes then we almost get an FP-tree [
        <xref ref-type="bibr" rid="ref14">13</xref>
        ]).
Optionally, when the support of candidates of size k is
determined we can delete those reduced transactions that do not
contain candidates of size k.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. Implementation details</title>
      <p>APRIORI is implemented in an object-oriented manner
in language C++. STL possibilities (vector, set, map)
are heavily used. The algorithm (class Apriori) and the
data structure (class Trie) are separated. We can change
the data structure (for example to a hash-tree) without
modifying the source code of the algorithm.</p>
      <p>The baskets in a file are first stored in a
vector&lt;...&gt;. If we choose to store input –which
is the default– the reduced baskets are stored in a
map&lt;vector&lt;...&gt;,unsigned long&gt;, where the
second parameter is the number of times that reduced
basket occurred. A better solution would be to apply
trie, because map does not make use of the fact that two
baskets can have the same prefixes. Hence insertion of a
basket would be faster, and the memory need would be
smaller, since the same prefixes would be stored just once.
Because of the lack of time trie-based basket storing was
not implemented and we do not delete a reduced basket
from the map if it did not contain any candidate during
some scan.</p>
      <p>The class Trie can be programmed in many ways.
We’ve chosen to implement it with vectors and arrays. It
is simple, fast and minimal with respect to memory need.
Each node is described by the same element of the vectors
(or row of the arrays). The root belongs to the 0 th element
of each vector. The following figure shows the way the trie
is represented by vectors and arrays.</p>
      <p>The vector edge number stores the number of edges
of the nodes. The itemarray[i] stores the label of the
edges, statearray[i] stores the end node of the edges
of node i. Vectors parent[i] and maxpath[i] store
the parents and the length of the longest path respectively.
The occurrences of itemset represented by the nodes can be
found in vector counter.</p>
      <p>For vectors we use vector class offered by the STL,
but arrays are stored in a traditional C way. Obviously, it
is not a fixed-size array (which caused the ugly calloc,
realloc commands in the code). Each row is as long as
many edges the node has, and new rows are inserted as the
trie grows (during candidate generation). A more elegant
way would be if the arrays were implemented as a vector
of vectors. The code would be easier to understand and
shorter, because the algorithms of STL could also be used.
However, the algorithm would be slower because
determining a value of the array takes more time. Tests showed that
sacrificing a bit from readability leads to 10-15% speed up.</p>
      <p>In our implementation we do not adapt on-line 2-itemset
candidate generation (see Section 3.4) but use a vector and
an array (temp counter array) for determining the
support of 1- and 2-itemset candidates efficiently.</p>
      <p>The vector and array description of a trie makes it
possible to give a fast implementation of the basic functions
(like candidate generation, support count, . . . ). For
example, deleting infrequent nodes and pulling the vectors
together is achieved by a single scan of the vectors. For more
details readers are referred to the html-based
documentation.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Experimental results</title>
      <p>Here we present the experimental results of our
implementation of APRIORI and APRIORI-BRAVE compared
to the two famous APRIORI implementations by
Christian Borgelt (version 4.08)1 and Bart Goethals (release date:
01/06/2003)2. 3 databases were used: the well-known
T40I10D100K and T10I4D100K and a coded log of a
clickstream data of a Hungarian on-line news portal (denoted
by kosarak). This database contains 990002 transactions of
size 8.1 on the average.</p>
      <p>Test were run on a PC with 2.8 GHz dual Xeon
processors and 3Gbyte RAM. The operating system was
Debian Linux, running times were obtained by the
/usr/bin/time command. The following 3 tables
present the test results of the 3 different implementations
of APRIORI and the APRIORI BRAVE on the 3 databases.
Each test was carried out 3 times; the tables contain the
averages of the results. The two well-known implementations
are denoted by the last name of the coders.
1http://fuzzy.cs.uni-magdeburg.de/ borgelt/software.html#assoc
2http://www.cs.helsinki.fi/u/goethals/software/index.html
4.23
10.27
17.87
34.07
70.3
85.23
Bodon
impl.</p>
      <p>Strange, but hashing technique not always resulted in
faster execution. The reason for this might be that small
vectors are cached in, where linear search is very fast. If we
enlarge the size of the vector by altering it into a hashtable,
then the vector may be moved into the memory, where read
is a slower operation. Applying hashing technique is the
other example when an accelerating technique does not
result in improvement.
6. Further Improvement and Research
Possibilities</p>
      <p>Our APRIORI implementation can be further improved
if trie is used to store reduced basket, and a reduced basket
is removed if it does not contain any candidate.</p>
      <p>We mentioned that there are two basic ways of finding
the contained candidates in a given transaction. Further
theoretical and experimental analysis may lead to the
conclusion that a mixture of the two approaches would lead to the
fastest execution.</p>
      <p>Theoretically, hashing technique accelerates support
count. However, experiments did not support this claim.
Further investigations are needed to clear the possibilities
of this technique.</p>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusion</title>
      <p>Determining frequent objects (itemsets, episodes,
sequential patterns) is one of the most important fields of data
mining. It is well known that the way candidates are
defined has great effect on running time and memory need,
and this is the reason for the large number of algorithms. It
is also clear that the applied data structure also influences
1
2
5
25
60
100
efficiency parameters. However, the same algorithm that
uses a certain data structure has a wide variety of
implementation. In this paper, we showed that different
implementation results in different running time, and the differences can
exceed differences between algorithms. We presented an
implementation that solved frequent itemset mining
problem in most cases faster than other well-known
implementations.</p>
    </sec>
    <sec id="sec-10">
      <title>8. Appendix</title>
      <p>Let us analyze formally how frequency code accelerate
the search. Suppose that the number of frequent items is m,
and the j th most frequent has to be searched for m j times
(n1 ≥ n2 ≥ . . . ≥ nm) and n = im=1 ni. If an item is in
the position j, then the cost of finding it is c · j, where c is
a constant. For the sake of simplicity c is omitted. The total
cost of search based on frequency codes is jm=1 j · nj .</p>
      <p>How much is the cost if the list is not ordered by
frequencies? We cannot determine this precisely, because we
don’t know which item is in the first position, which item
is in the second, etc. We can calculate the expected time of
the total cost if we assume that each order occurs with the
same probability. Then the probability of each permutation
is m1! . Thus</p>
      <p>E[total cost] =
=
π
1
m!
1
m! · (cost of π)</p>
      <p>m
π j=1
π(j)nπ(j).</p>
      <p>Here π runs through the permutations of 1, 2, . . . , m, and
the jth item of π is denoted by π(j). Since each item gets
to each position (m − 1)! times, we obtain that</p>
      <p>E[total cost] =
1
m!</p>
      <p>(m − 1)!
=
1 (m + 1)m
m 2
m
j=1
m
j=1
nj
m
k=1</p>
      <p>k
nj =</p>
      <p>It is easy to prove that E[total cost] is greater than or
equal to the total cost of the search based on frequency
codes (because of the condition n 1 ≥ n2 ≥ . . . ≥ nm).
We want to know more, namely how small the ratio
m
j=1 j · nj
n m+1
2
can be. In the worst case (n1 = n2 = . . . = nm) it is 1, in
best case (n1 = n − m + 1, n2 = n3 = . . . = nm = 1) it
converges to 0, if n → ∞.</p>
      <p>We can not say anything more unless the probability
distribution of frequent items is known. In many applications,
there are some very frequent items, and the probability of
rare items differs slightly. This is why we voted for an
exponential decrease. In our model the probability of
occurrence of the j th most frequent item is pj = aebj , where
a &gt; 0, b &lt; 0 are two parameters, such that aeb ≤ 1 holds.
Parameter b can be regarded as the gradient of the
distribution, and parameter a determines the starting point 3.</p>
      <p>3Note that P pj does not have to be equal with 1, since more than one
item can occur in a basket.
(1)
j · xj =
(j + 1) · xj −
m</p>
      <p>xj =
=
j=1 j=1
mxm+2 − (m + 1)xm+1 + x</p>
      <p>(x − 1)2
The ratio of total costs can be expressed in a closed formula:
cost ratio =
2x(mxm+1 − (m + 1)xm + 1)
(xm+1 − 1)(x − 1)(m + 1)
,
(2)
where x = eb. We can see, that the speedup is independent
of a. In Figure 6 3 different distributions can be seen. The
first is gently sloping, the second has larger graduation and
the last distribution is quite steep.</p>
      <p>We suppose, that the ratio of occurrences is the same
as the ratio of the probabilities, hence n1 : n2 : . . . :
nm = p1 : p2 : . . . : pm. From this and the condition
n = jm=1 nj , we infer that nj = n Pkmp=j1 pk . Using the
formula for geometric series, and using the notation x = e b
we obtain</p>
      <p>xj (x − 1)
nj = n xm+1 − 1 = n
The total cost can be determined:
x − 1 j
xm+1 − 1 · x .
be 0.39, 0.73 and 0.8, which meets our expectations: by
adapting frequency codes the search time will drop sharply
if the probabilities differ greatly from each other. We have
to remember that frequency codes do not have any effect on
nodes where binary search is applied.
m
j=1
.</p>
      <p>xj+1
−
0.07e^(-0.06x)
0.7e^(-0.35x)
0.4e^(-0.1x)
j=1
Let us calculate</p>
      <p>n(x − 1)
j · nj = xm+1 − 1 j=1</p>
      <p>If we substitute the parameters of the probability
distribution to the formula (2) (with m = 10), then the result will</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Swami</surname>
          </string-name>
          .
          <article-title>Mining association rules between sets of items in large databases</article-title>
          .
          <source>In Proc. of the ACM SIGMOD Conference on Management of Data</source>
          , pages
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. I.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          14.
          <article-title>2 Verkamo. Fast discovery of association rules</article-title>
          .
          <source>In Advances 18.1 in Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>307</fpage>
          -
          <lpage>328</lpage>
          ,
          <issue>22</issue>
          .8
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          29.6 [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>The International Conference on Very Large Databases</source>
          , pages
          <fpage>487</fpage>
          -
          <lpage>499</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Mining sequential patterns</article-title>
          . In P. S. Yu and
          <string-name>
            <surname>A. L. P.</surname>
          </string-name>
          Chen, editors,
          <source>Proc. 11th Int. Conf. Data Engineering</source>
          , ICDE, pages
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          . IEEE Press,
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          1995.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. U.</given-names>
            <surname>Tansel</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. E. Arkun.</surname>
          </string-name>
          <article-title>An efficient algorithm to update large itemsets with early pruning</article-title>
          .
          <source>In Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>287</fpage>
          -
          <lpage>291</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Bayardo</surname>
          </string-name>
          , Jr.
          <article-title>Efficiently mining long patterns from databases</article-title>
          .
          <source>In Proceedings of the 1998 ACM SIGMOD international conference on Management of data</source>
          , pages
          <fpage>85</fpage>
          -
          <lpage>93</lpage>
          . ACM Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bodon</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ro</surname>
          </string-name>
          <article-title>´nyai. Trie: an alternative data structure for data mining algorithms</article-title>
          . to appear
          <source>in Computers and Mathematics with Applications</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsur</surname>
          </string-name>
          .
          <article-title>Dynamic itemset counting and implication rules for market basket data</article-title>
          .
          <source>SIGMOD Record (ACM Special Interest Group on Management of Data)</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>255</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. W.-L.</given-names>
            <surname>Cheung</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Wong</surname>
          </string-name>
          .
          <article-title>Maintenance of discovered association rules in large databases: An incremental updating technique</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>106</fpage>
          -
          <lpage>114</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>D. W.-L. Cheung</surname>
            ,
            <given-names>S. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kao</surname>
          </string-name>
          .
          <article-title>A general incremental technique for maintaining discovered association rules</article-title>
          .
          <source>In Database Systems for Advanced Applications</source>
          , pages
          <fpage>185</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          .
          <article-title>Discovery of multiple-level rules from large databases</article-title>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          .
          <article-title>Survey on frequent pattern mining</article-title>
          .
          <source>Technical report</source>
          , Helsinki Institute for Information Technology,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation</article-title>
          . In W. Chen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          , and P. A. Bernstein, editors,
          <source>2000 ACM SIGMOD Intl. Conference on Management of Data</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . ACM Press,
          <year>05 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huhtala</surname>
          </string-name>
          , J. Ka¨rkka¨inen, P. Porkka, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          . TANE:
          <article-title>An efficient algorithm for discovering functional and approximate dependencies</article-title>
          .
          <source>The Computer Journal</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <fpage>100</fpage>
          -
          <lpage>111</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huhtala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Porkka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          .
          <article-title>Efficient discovery of functional and approximate dependencies using partitions</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>392</fpage>
          -
          <lpage>401</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y. F.</given-names>
            <surname>Jiawei</surname>
          </string-name>
          <article-title>Han. Discovery of multiple-level association rules from large databases</article-title>
          .
          <source>In Proc. of the 21st International Conference on Very Large Databases (VLDB)</source>
          , Zurich, Switzerland,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Knuth</surname>
          </string-name>
          .
          <source>The Art of Computer Programming</source>
          Vol.
          <volume>3</volume>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          ,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. I. Verkamo.</surname>
          </string-name>
          <article-title>Discovering frequent episodes in sequences</article-title>
          .
          <source>In Proceedings of the First International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>210</fpage>
          -
          <lpage>215</lpage>
          . AAAI Press,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ozden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramaswamy</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          .
          <article-title>Cyclic association rules</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>412</fpage>
          -
          <lpage>421</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>An effective hash based algorithm for mining association rules</article-title>
          . In M. J. Carey and
          <string-name>
            <surname>D. A</surname>
          </string-name>
          . Schneider, editors,
          <source>Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <fpage>175</fpage>
          -
          <lpage>186</lpage>
          , San Jose, California,
          <fpage>22</fpage>
          -
          <lpage>25</lpage>
          1995.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarasere</surname>
          </string-name>
          , E. Omiecinsky, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Navathe</surname>
          </string-name>
          .
          <article-title>An efficient algorithm for mining association rules in large databases</article-title>
          .
          <source>In Proc. 21st International Conference on Very Large Databases (VLDB)</source>
          , Zurich, Switzerland,
          <source>Also Gatech Technical Report No. GIT-CC-95-04</source>
          .,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining generalized association rules</article-title>
          .
          <source>In Proc. of the 21st International Conference on Very Large Databases (VLDB)</source>
          , Zurich, Switzerland,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining sequential patterns: Generalizations and performance improvements</article-title>
          .
          <source>Technical report</source>
          , IBM Almaden Research Center, San Jose, California,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bodagala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alsabti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranka</surname>
          </string-name>
          .
          <article-title>An efficient algorithm for the incremental updation of association rules in large databases</article-title>
          .
          <source>page 263.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bodagala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alsabti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranka</surname>
          </string-name>
          .
          <article-title>An efficient algorithm for the incremental updation of association rules in large databases</article-title>
          .
          <source>In Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>263</fpage>
          -
          <lpage>266</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          .
          <article-title>Sampling large databases for association rules</article-title>
          .
          <source>In The VLDB Journal</source>
          , pages
          <fpage>134</fpage>
          -
          <lpage>145</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>