<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ARMOR: Association Rule Mining based on ORacle</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vikram Pudi</string-name>
          <email>vikram@iiit.net</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayant R. Haritsa</string-name>
          <email>haritsa@dsl.serc.iisc.ernet.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Systems Lab, SERC, Indian Institute of Science</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Intl. Inst. of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we first focus our attention on the question of how much space remains for performance improvement over current association rule mining algorithms. Our strategy is to compare their performance against an “Oracle algorithm” that knows in advance the identities of all frequent itemsets in the database and only needs to gather their actual supports to complete the mining process. Our experimental results show that current mining algorithms do not perform uniformly well with respect to the Oracle for all database characteristics and support thresholds. In many cases there is a substantial gap between the Oracle's performance and that of the current mining algorithms. Second, we present a new mining algorithm, called ARMOR, that is constructed by making minimal changes to the Oracle algorithm. ARMOR consistently performs within a factor of two of the Oracle on both real and synthetic datasets over practical ranges of support specifications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We focus our attention on the question of how much
space remains for performance improvement over current
association rule mining algorithms. Our approach is to
compare their performance against an “Oracle algorithm” that
knows in advance the identities of all frequent itemsets in
the database and only needs to gather the actual supports
of these itemsets to complete the mining process. Clearly,
any practical algorithm will have to do at least this much
work in order to generate mining rules. This “Oracle
approach” permits us to clearly demarcate the maximal space
available for performance improvement over the currently
available algorithms. Further, it enables us to construct new
mining algorithms from a completely different perspective,
namely, as minimally-altered derivatives of the Oracle.</p>
      <p>First, we show that while the notion of the Oracle is
conceptually simple, its construction is not equally
straightforward. In particular, it is critically dependent on the choice
of data structures used during the counting process. We
present a carefully engineered implementation of Oracle
that makes the best choices for these design parameters at
each stage of the counting process. Our experimental results
show that there is a considerable gap in the performance
between the Oracle and existing mining algorithms.</p>
      <p>
        Second, we present a new mining algorithm, called
ARMOR (Association Rule Mining based on ORacle), whose
structure is derived by making minimal changes to the
Oracle, and is guaranteed to complete in two passes over the
database. Although ARMOR is derived from the Oracle, it
may be seen to share the positive features of a variety of
previous algorithms such as PARTITION [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], CARMA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
AS-CPA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], VIPER [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and DELTA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our empirical
study shows that ARMOR consistently performs within a
factor of two of the Oracle, over both real
(BMS-WebView1 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] from Blue Martini Software) and synthetic databases
(from the IBM Almaden generator [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) over practical ranges
of support specifications.
      </p>
      <p>The environment we consider is one where the pattern
lengths in the database are small enough that the size of
mining results is comparable to the available main memory.
This holds when the mined data conforms to the sparse
nature of market basket data for which association rule mining
was originally intended. It is perhaps inappropriate to
apply the problem of mining all frequent itemsets on dense
datasets.</p>
      <p>For ease of exposition, we will use the notation shown in
Table 1 in the remainder of this paper.</p>
      <p>Organization The remainder of this paper is organized as
follows: The design of the Oracle algorithm is described in
Section 2 and is used to evaluate the performance of current
algorithms in Section 3. Our new ARMOR algorithm is
presented in Section 4 and its main memory requirements
are discussed in Section 5. The performance of ARMOR is
evaluated in Section 6. Finally, in Section 7, we summarize
the conclusions of our study.</p>
      <p>Database of customer purchase transactions
User-specified minimum rule support
Set of frequent itemsets in
Set of itemsets in the negative border of
Set of disjoint partitions of
Transactions in partitions scanned so far during
algorithm execution excluding the current partition
Transactions in partitions scanned so far during
algorithm execution including the current partition
DAG structure to store candidates</p>
    </sec>
    <sec id="sec-2">
      <title>The Oracle Algorithm</title>
      <p>In this section we present the Oracle algorithm which,
as mentioned in the Introduction, “magically” knows in
advance the identities of all frequent itemsets in the database
and only needs to gather the actual supports of these
itemsets. Clearly, any practical algorithm will have to do at least
this much work in order to generate mining rules. Oracle
takes as input the database, in item-list format (which
is organized as a set of rows with each row storing an
ordered list of item-identifiers (IID), representing the items
purchased in the transaction), the set of frequent itemsets,
, and its corresponding negative border, , and outputs
the supports of these itemsets by making one scan over the
database. We first describe the mechanics of the Oracle
algorithm below and then move on to discuss the rationale
behind its design choices in Section 2.2.
2.1</p>
      <sec id="sec-2-1">
        <title>The Mechanics of Oracle</title>
        <p>For ease of exposition, we first present the manner in
which Oracle computes the supports of 1-itemsets and
2itemsets and then move on to longer itemsets. Note,
however, that the algorithm actually performs all these
computations concurrently in one scan over the database.
2.1.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Counting Singletons and Pairs</title>
        <p>
          Data-Structure Description The counters of singletons
(1-itemsets) are maintained in a 1-dimensional lookup
array, , and that of pairs (2-itemsets), in a lower
triangular 2-dimensional lookup array, (Similar arrays are also
used in Apriori [
          <xref ref-type="bibr" rid="ref11 ref2">2, 11</xref>
          ] for its first two passes.) The th entry
in the array contains two fields: (1) , the counter
for the itemset corresponding to the th item, and (2)
, the number of frequent itemsets prior to in , if
is frequent; null, otherwise.
Algorithm Description The ArrayCount function shown
in Figure 1 takes as inputs, a transaction along with
and , and updates the counters of these arrays over . In
the ArrayCount function, the individual items in the
transaction are enumerated (lines 2–5) and for each item, its
corresponding count in is incremented (line 3). During
this process, the frequent items in are stored in a separate
itemset (line 5). We then enumerate all pairs of items
contained in (lines 6–10) and increment the counters of
the corresponding 2-itemsets in (lines 8–10).
Data-Structure Description Itemsets in of length
greater than 2 and their related information (counters, etc.)
are stored in a DAG structure , which is pictorially shown
in Figure 2 for a database with items A, B, C, D .
Although singletons and pairs are stored in lookup arrays, as
mentioned before, for expository ease, we assume that they
too are stored in in the remainder of this discussion.
        </p>
        <p>Each itemset is stored in a separate node of and is
linked to the first two (in a lexicographic ordering) of its
subsets. We use the terms “mother” and “father” of an
itemset to refer to the (lexicographically) smaller subset and the
(lexicographically) larger subset, respectively. E.g., A, B
and A, C are the mother and father respectively of A, B,
C . For each itemset in , we also store with it links to
those supersets of for which is a mother. We call this
list of links as childset.</p>
        <p>
          Since each itemset is stored in a separate node in the
DAG, we use the terms “itemset” and “node”
interchangeably in the remainder of this discussion. Also, we use to
denote the set of itemsets that are stored in the DAG
structure .
ABC
Algorithm Description We use a partitioning [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] scheme
wherein the database is logically divided into disjoint
horizontal partitions . In this scheme, itemsets
being counted are enumerated only at the end of each
partition and not after every tuple. Each partition is as large as
can fit in available main memory. For ease of exposition, we
assume that the partitions are equi-sized. However, we
hasten to add that the technique is easily extendible to arbitrary
partition sizes.
        </p>
        <p>The pseudo-code of Oracle is shown in Figure 3 and
operates as follows: The ReadNextPartition function (line 3)
reads tuples from the next partition and simultaneously
creates tid-lists (within that partition) of singleton itemsets in
. Note that this conversion of the database to the tid-list
(TL) format is an on-the-fly operation and does not change
the complexity of Oracle by more than a (small) constant
factor. Next, the Update function (line 5) is applied on
each singleton in . This function takes a node in as
input and updates the counts of all descendants of to
reflect their counts over the current partition. The count of
any itemset within a partition is equal to the length of its
corresponding tidlist (within that partition). The tidlist of
an itemset can be obtained as the intersection of the tidlists
of its mother and father and this process is started off using
the tidlists of frequent 1-itemsets. The exact details of tidlist
computation are discussed later.</p>
        <p>We now describe the manner in which the itemsets in
are enumerated after reading in a new partition. The set of
links, , induce a spanning tree of (e.g.
consider only the solid edges in Figure 2). We perform a
depth first search on this spanning tree to enumerate all its
itemsets. When a node in the tree is visited, we compute the
tidlists of all its children. This ensures that when an itemset
is visited, the tidlists of its mother and father have already
been computed.</p>
        <p>A tid-list of an itemset
contain .</p>
        <p>is an ordered list of TIDs of transactions that
Oracle ( , )
Input: Database , Itemsets to be Counted
Output: Itemsets in with Supports
1. = Number of Partitions
2. for = 1 to
3. ReadNextPartition( , );
4. for each singleton in
5. Update( );</p>
        <p>The above processing is captured in the function Update
whose pseudo-code is shown in Figure 4. Here, the tidlist
of a given node is first converted to the tid-vector (TV)
format (line 1). Then, tidlists of all children of are
computed (lines 2–4) after which the same children are visited
in a depth first search (lines 5–6).</p>
        <p>The mechanics of tidlist computation, as promised
earlier, are given in Figure 5. The Intersect function shown
here takes as input a tid-vector and a tid-list . Each
in is added to the result if is 1 (lines 2–5) where
is defined in line 3 and represents the position of the
transaction relative to the current partition.</p>
        <p>A tid-vector of an itemset is a bit-vector of 1’s and 0’s to
represent the presence or absence respectively, of in the set of customer
transactions.
2.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Rationale for the Oracle Design</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Performance Study</title>
      <p>
        We show that Oracle is optimal in two respects: (1) It
enumerates only those itemsets in that need to be
enumerated, and (2) The enumeration is performed in the most
efficient way possible. These results are based on the
following two theorems. Due to lack of space we have deferred
the proofs of theorems to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Theorem 2.1 If the size of each partition is large enough
that every itemset in of length greater than 2 is
present at least once in it, then the only itemsets being
enumerated in the Oracle algorithm are those whose counts
need to be incremented in that partition.</p>
      <p>Theorem 2.2 The cost of enumerating each itemset in
Oracle is with a tight constant factor.</p>
      <p>
        While Oracle is optimal in these respects, we note that
there may remain some scope for improvement in the details
of tidlist computation. That is, the Intersect function
(Figure 5) which computes the intersection of a tid-vector and
a tid-list requires operations. itself was
originally constructed from a tid-list, although this cost is
amortized over many calls to the Intersect function. We plan
to investigate in our future work whether the intersection of
two sets can, in general, be computed more efficiently – for
example, using diffsets, a novel and interesting approach
suggested in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The diffset of an itemset is the
setdifference of the tid-list of from that of its mother.
Diffsets can be easily incorporated in Oracle – only the Update
function in Figure 4 of Section 2 is to be changed to
compute diffsets instead of tidlists by following the techniques
suggested in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Advantages of Partitioning Schemes Oracle, as
discussed above, uses a partitioning scheme. An alternative
commonly used in current association rule mining
algorithms, especially in hashtree [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] based schemes, is to use a
tuple-by-tuple approach. A problem with the tuple-by-tuple
approach, however, is that there is considerable wasted
enumeration of itemsets. The core operation in these algorithms
is to determine all candidates that are subsets of the current
transaction. Given that a frequent itemset is present in
the current transaction, we need to determine all candidates
that are immediate supersets of and are also present in
the current transaction. In order to achieve this, it is often
necessary to enumerate and check for the presence of many
more candidates than those that are actually present in the
current transaction.
      </p>
      <p>
        In the previous section, we have described the Oracle
algorithm. In order to assess the performance of current
mining algorithms with respect to the Oracle algorithm, we have
chosen VIPER [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and FP-growth [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], among the latest in
the suite of online mining algorithms. For completeness
and as a reference point, we have also included the classical
Apriori in our evaluation suite.
      </p>
      <p>Our experiments cover a range of database and
mining workloads, and include the typical and extreme cases
considered in previous studies – the only difference is that
we also consider database sizes that are significantly larger
than the available main memory. The performance metric
in all the experiments is the total execution time taken by
the mining operation.</p>
      <p>
        The databases used in our experiments were
synthetically generated using the technique described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
attempt to mimic the customer purchase behavior seen in
retailing environments. The parameters used in the synthetic
generator and their default values are described in Table 2.
In particular, we consider databases with parameters T10.I4,
T20.I12 and T40.I8 with 10 million tuples in each of them.
      </p>
      <p>Parameter</p>
      <p>Meaning
Number of items
Mean transaction length
Number of potentially
frequent itemsets
Mean length of potentially
frequent itemsets
Number of transactions
in the database</p>
      <p>We set the rule support threshold values to as low as
was feasible with the available main memory. At these low
support values the number of frequent itemsets exceeded
twenty five thousand! Beyond this, we felt that the number
of rules generated would be enormous and the purpose of
mining – to find interesting patterns – would not be served.
In particular, we set the rule support threshold values for the
T10.I4, T20.I12 and T40.I8 databases to the ranges (0.1%–
2%), (0.4%–2%) and (1.15%–5%), respectively.</p>
      <p>Our experiments were conducted on a 700-MHz Pentium
III workstation running Red Hat Linux 6.2, configured with
a 512 MB main memory and a local 18 GB SCSI 10000
rpm disk. For the T10.I4, T20.I12 and T40.I8 databases,
the associated database sizes were approximately 500MB,
900MB and 1.7 GB, respectively. All the algorithms in our
evaluation suite are written in C++. We implemented a
basic version of the FP-growth algorithm wherein we assume
that the entire FP-tree data structure fits in main memory.
Finally, the partition size in Oracle was fixed to be 20K
tuples.
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental Results for Current Mining Algorithms</title>
        <p>We now report on our experimental results. We
conducted two experiments to evaluate the performance of
current mining algorithms with respect to the Oracle. Our
first experiment was run on large (10M tuples) databases,
while our second experiment was run on small (100K
tuples) databases.
3.1.1</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiment 1: Performance of Current Algorithms</title>
        <p>In our first experiment, we evaluated the performance of
Apriori, VIPER and Oracle algorithms for the T10.I4,
T20.I12 and T40.I8 databases each containing 10M
transactions and these results are shown in Figures 6a–c. The
x-axis in these graphs represent the support threshold
values while the y-axis represents the response times of the
algorithms being evaluated.</p>
        <p>In these graphs, we see that the response times of all
algorithms increase exponentially as the support threshold is
reduced. This is only to be expected since the number of
itemsets in the output, , increases exponentially with
decrease in the support threshold.</p>
        <p>We also see that there is a considerable gap in the
performance of both Apriori and VIPER with respect to Oracle.
For example, in Figure 6a, at a support threshold of 0.1%,
the response time of VIPER is more than 6 times that of
Oracle whereas the response time of Apriori is more than 26
times!</p>
        <p>
          In this experiment, we could not evaluate the
performance of FP-growth because it did not complete in any of
our runs on large databases due to its heavy and database
size dependent utilization of main memory. The reason for
this is that FP-growth stores the database itself in a
condensed representation in a data structure called FP-tree. In
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the authors briefly discuss the issue of constructing
disk-resident FP-trees. We however, did not take this into
account in our implementation.
3.1.2
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiment 2: Small Databases</title>
        <p>Since, as mentioned above, it was not possible for us to
evaluate the performance of FP-growth on large databases due
to its heavy utilization of main memory, we evaluated the
performance of FP-growth and other current algorithms on
The original implementation by Han et al. was not available.
small databases consisting of 100K transactions. The
results of this experiment are shown in Figures 7a–c, which
correspond to the T10.I4, T20.I12 and T40.I8 databases,
respectively.</p>
        <p>In these graphs, we see there continues to be a
considerable gap in the performance of current mining
algorithms with respect to Oracle. For example, for the T40.I8
database, the response time of FP-growth is more than 8
times that of Oracle for the entire support threshold range.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The ARMOR Algorithm</title>
      <p>ARMOR ( , )
Input: Database , Set of Items , Minimum Support
Output: with Supports
1. = Number of Partitions</p>
      <p>
        In the previous section, our experimental results have
shown that there is a considerable gap in the performance
between the Oracle and existing mining algorithms. We
now move on to describe our new mining algorithm,
ARMOR (Association Rule Mining based on ORacle). In this
section, we overview the main features and the flow of
execution of ARMOR – the details of candidate generation are
deferred to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] due to lack of space.
      </p>
      <p>The guiding principle in our design of the ARMOR
algorithm is that we consciously make an attempt to determine
the minimal amount of change to Oracle required to result
in an online algorithm. This is in marked contrast to the
earlier approaches which designed new algorithms by trying to
address the limitations of previous online algorithms. That
is, we approach the association rule mining problem from a
completely different perspective.
Oracle
Apriori
VIPER
Oracle
Apriori</p>
      <p>
        VIPER
FP-growth
In ARMOR, as in Oracle, the database is conceptually
partitioned into disjoint blocks . At most
two passes are made over the database. In the first pass we
form a set of candidate itemsets, , that is guaranteed to
be a superset of the set of frequent itemsets. During the
first pass, the counts of candidates in are determined over
each partition in exactly the same way as in Oracle by
maintaining the candidates in a DAG structure. The 1-itemsets
and 2-itemsets are stored in lookup arrays as in Oracle. But
unlike in Oracle, candidates are inserted and removed from
at the end of each partition. Generation and removal of
candidates is done simultaneously while computing counts.
The details of candidate generation and removal during the
first pass are described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] due to lack of space. For ease
of exposition we assume in the remainder of this section
that all candidates (including 1-itemsets and 2-itemsets) are
stored in the DAG.
      </p>
      <p>
        Along with each candidate , we also store the
following three integers as in the CARMA algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: (1)
: the number of occurrences of since was
last inserted in . (2) : the index of the
partition at which was inserted in . (3)
: upper bound on the number of occurrences of before
was inserted in . and
are computed when is inserted into in a manner
identical to CARMA.
      </p>
      <p>
        While the CARMA algorithm works on a tuple-by-tuple
basis, we have adapted the semantics of these fields to suit
the partitioning approach. If the database scanned so far is
(refer Table 1), then the support of any candidate in
will lie in the range
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These bounds are denoted by
minSupport( ) and maxSupport( ), respectively. We
define an itemset to be -frequent if minSupport
. Unlike in the CARMA algorithm where only
frequent itemsets are stored at any stage, the DAG structure
in ARMOR contains other candidates, including the
negative border of the -frequent itemsets, to ensure efficient
candidate generation. The details are given in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        At the end of the first pass, the candidate set is pruned
to include only -frequent itemsets and their negative
border. The counts of itemsets in over the entire database
are determined during the second pass. The counting
process is again identical to that of Oracle. No new candidates
are generated during the second pass. However, candidates
may be removed. The details of candidate removal in the
second pass is deferred to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The pseudo-code of ARMOR is shown in Figure 8 and
is explained below.
4.1</p>
      <sec id="sec-4-1">
        <title>First Pass</title>
        <p>At the beginning of the first pass, the set of candidate
itemsets is initialized to the set of singleton itemsets (line
2). The ReadNextPartition function (line 4) reads tuples
from the next partition and simultaneously creates tid-lists
of singleton itemsets in .</p>
        <p>
          After reading in the entire partition, the Update1
function (details in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) is applied on each singleton in (lines
5–7). It increments the counts of existing candidates by
their corresponding counts in the current partition. It is also
responsible for generation and removal of candidates.
        </p>
        <p>At the end of the first pass, contains a superset of
the set of frequent itemsets. For a candidate in that has
been inserted at partition , its count over the partitions
will be available.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Second Pass</title>
        <p>At the beginning of the second pass, candidates in that
are neither -frequent nor part of the current negative
border are removed from (line 8). For candidates that have
been inserted in at the first partition, their counts over the
entire database will be available. These itemsets with their
counts are output (line 9). The OutputFinished function
also performs the following task: If it outputs an itemset
and has no supersets left in , is removed from .</p>
        <p>
          During the second pass, the ReadNextPartition
function (line 13) reads tuples from the next partition and
creates tid-lists of singleton itemsets in . After reading in the
entire partition, the Update2 function (details in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) is
applied on each singleton in (lines 14–15). Finally, before
reading in the next partition we check to see if there are any
more candidates. If not, the mining process terminates.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Memory Utilization in ARMOR</title>
      <p>In the design and implementation of ARMOR, we have
opted for speed in most decisions that involve a space-speed
tradeoff. Therefore, the main memory utilization in
ARMOR is certainly more as compared to algorithms such as
Apriori. However, in the following discussion, we show that
the memory usage of ARMOR is well within the reaches of
current machine configurations. This is also experimentally
confirmed in the next section.</p>
      <p>The main memory consumption of ARMOR comes from
the following sources: (1) The 1-d and 2-d arrays for
storing counters of singletons and pairs, respectively; (2) The
DAG structure for storing counters (and tidlists) of longer
itemsets, and (3) The current partition.</p>
      <p>
        The total number of entries in the 1-d and 2-d arrays and
in the DAG structure corresponds to the number of
candidates in ARMOR, which as we have discussed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], is
only marginally more than . For the moment, if
we disregard the space occupied by tidlists of itemsets, then
the amortized amount of space taken by each candidate is a
small constant (about 10 integers for the dag and 1 integer
for the arrays). E.g., if there are 1 million candidates in the
dag and 10 million in the array, the space required is about
80MB. Since the environment we consider is one where
the pattern lengths are small, the number of candidates will
typically be well within the available main memory. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
discusses alternative approaches when this assumption does
not hold.
      </p>
      <p>Regarding the space occupied by tidlists of itemsets, note
that ARMOR only needs to store tidlists of -frequent
itemsets. The number of -frequent itemsets is of the same
order as the number of frequent itemsets, . The total space
occupied by tidlists while processing partition is then
bounded by integers. E.g., if and
, then the space occupied by tidlists is bounded
by about 400MB. We assume to be in the range of a
few thousands at most because otherwise the total number
of rules generated would be enormous and the purpose of
mining would not be served. Note that the above bound
is very pessimistic. Typically, the lengths of tidlists are
much smaller than the partition size, especially as the
itemset length increases.</p>
      <p>Main memory consumed by the current partition is small
compared to the above two factors. E.g., If each transaction
occupies 1KB, a partition of size 20K would require only
20MB of memory. Even in these extreme examples, the
total memory consumption of ARMOR is 500MB, which is
acceptable on current machines.</p>
      <p>Therefore, in general we do not expect memory to be an
issue for mining market-basket databases using ARMOR.
Further, even if it does happen to be an issue, it is easy to
modify ARMOR to free space allocated to tidlists at the
expense of time: can be freed after line 3 in the
Update function shown in Figure 4.</p>
      <p>A final observation is that the main memory
consumption of ARMOR is proportional to the size of the output
and does not “explode” as the input problem size increases.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results for ARMOR</title>
      <p>We evaluated the performance of ARMOR with respect
to Oracle on a variety of databases and support
characteristics. We now report on our experimental results for the same
performance model described in Section 3. Since Apriori,
FP-growth and VIPER have already been compared against
Oracle in Section 3.1, we do not repeat those observations
here, but focus on the performance of ARMOR. This lends
to the visual clarity of the graphs. We hasten to add that
ARMOR does outperform the other algorithms.
6.1</p>
      <sec id="sec-6-1">
        <title>Experiment 3: Performance of ARMOR</title>
        <p>In this experiment, we evaluated the response time
performance of the ARMOR and Oracle algorithms for the
T10.I4, T20.I12 and T40.I8 databases each containing 10M
transactions and these results are shown in Figures 9a–c.</p>
        <p>In these graphs, we first see that ARMOR’s performance
is close to that of Oracle for high supports. This is because
of the following reasons: The density of the frequent
itemset distribution is sparse at high supports resulting in only
a few frequent itemsets with supports “close” to .
Hence, frequent itemsets are likely to be locally frequent
within most partitions. Even if they are not locally
frequent in a few partitions, it is very likely that they are still
-frequent over these partitions. Hence, their counters are
updated even over these partitions. Therefore, the complete
counts of most candidates would be available at the end of
the first pass resulting in a “light and short” second pass.
Hence, it is expected that the performance of ARMOR will
be close to that of Oracle for high supports.</p>
        <p>Since the frequent itemset distribution becomes dense
at low supports, the above argument does not hold in this
support region. Hence we see that ARMOR’s performance
relative to Oracle decreases at low supports. But, what is
far more important is that ARMOR consistently performs
within a factor of two of Oracle. This is highlighted in
Table 3 where we show the ratios of the performance of
ARMOR to that of Oracle for the lowest support values
considered for each of the databases.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Experiment 4: MOR</title>
      </sec>
      <sec id="sec-6-3">
        <title>Memory Utilization in AR</title>
        <p>The previous experiments were conducted with the total
number of items, , being set to 1K. In this experiment we
set the value of to 20K items for the T10.I4 database –
this environment represents an extremely stressful situation
for ARMOR with regard to memory utilization due to the
very large number of items. Figure 10 shows the memory
utilization of ARMOR as a function of support for the
= 1K and = 20K cases. We see that the main memory
utilization of ARMOR scales well with the number of items.
For example, at the 0.1% support threshold, the memory
consumption of ARMOR for = 1K items was 104MB
while for = 20K items, it was 143MB – an increase in less
than 38% for a 20 times increase in the number of items!
The reason for this is that the main memory utilization of
ARMOR does not depend directly on the number of items,
but only on the size of the output, , as discussed in
Section 5.
6.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>Experiment 5: Real Datasets</title>
        <p>
          Despite repeated efforts, we were unable to obtain large
real datasets that conform to the sparse nature of market
basket data since such data is not publicly available due to
proprietary reasons. The datasets in the UC Irvine public
domain repository [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] which are commonly used in data
mining studies were not suitable for our purpose since they
(b) EachMovie
1200
1000
Oracle
ARMOR
Oracle
ARMOR
Oracle
ARMOR
are dense and have long patterns. We could however
obtain two datasets – BMS-WebView-1, a clickstream data
from Blue Martini Software [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and EachMovie, a movie
database from Compaq Equipment Corporation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which
we transformed to the format of boolean market basket data.
The resulting databases had 59,602 and 61,202 transactions
respectively with 870 and 1648 distinct items.
        </p>
        <p>We set the rule support threshold values for the
BMS-WebView-1 and EachMovie databases to the ranges
(0.06%–0.1%) and (3%–10%), respectively. The results of
these experiments are shown in Figures 11a–b. We see
in these graphs that the performance of ARMOR
continues to be within twice that of Oracle. The ratio of
ARMOR’s performance to that of Oracle at the lowest support
value of 0.06% for the BMS-WebView-1 database was 1.83
whereas at the lowest support value of 3% for the
EachMovie database it was 1.73.
6.4</p>
      </sec>
      <sec id="sec-6-5">
        <title>Discussion of Experimental Results</title>
        <p>We now explain the reasons as to why ARMOR should
typically perform within a factor of two of Oracle. First, we
notice that the only difference between the single pass of
Oracle and the first pass of ARMOR is that ARMOR
continuously generates and removes candidates. Since the
generation and removal of candidates in ARMOR is dynamic
and efficient, this does not result in a significant additional
cost for ARMOR.</p>
        <p>Since candidates in ARMOR that are neither -frequent
nor part of the current negative border are continuously
removed, any itemset that is locally frequent within a
partition, but not globally frequent in the entire database is likely
to be removed from during the course of the first pass
(unless it belongs to the current negative border). Hence the
resulting candidate set in ARMOR is a good approximation
of the required mining output. In fact, in our experiments,
we found that in the worst case, the number of candidates
counted in ARMOR was only about ten percent more than
the required mining output.</p>
        <p>The above two reasons indicate that the cost of the first
pass of ARMOR is only slightly more than that of (the
single pass in) Oracle.</p>
        <p>Next, we notice that the only difference between the
second pass of ARMOR and (the single pass in) Oracle is that
in ARMOR, candidates are continuously removed. Hence
the number of itemsets being counted in ARMOR during
the second pass quickly reduces to much less than that of
Oracle. Moreover, ARMOR does not necessarily perform
a complete scan over the database during the second pass
since the second pass ends when there are no more
candidates. Due to these reasons, we would expect that the cost
of the second pass of ARMOR is usually less than that of
(the single pass in) Oracle.</p>
        <p>Since the cost of the first pass of ARMOR is usually only
slightly more than that of (the single pass in) Oracle and that
of the second pass is usually less than that of (the single pass
in) Oracle, it follows that ARMOR will typically perform
within a factor of two of Oracle.</p>
        <p>In summary, due to the above reasons, it appears unlikely
for it to be possible to design algorithms that substantially
reduce either the number of database passes or the number
of candidates counted. These represent the primary
bottlenecks in association rule mining. Further, since ARMOR
utilizes the same itemset counting technique of Oracle,
further overall improvement without domain knowledge seems
extremely difficult. Finally, even though we have not proved
optimality of Oracle with respect to tidlist intersection, we
note that any smart intersection techniques that may be
implemented in Oracle can also be used in ARMOR.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>A variety of novel algorithms have been proposed in the
recent past for the efficient mining of association rules, each
in turn claiming to outperform its predecessors on a set of
standard databases. In this paper, our approach was to
quantify the algorithmic performance of association rule mining
algorithms with regard to an idealized, but practically
infeasible, “Oracle”. The Oracle algorithm utilizes a
partitioning strategy to determine the supports of itemsets in the
required output. It uses direct lookup arrays for counting
singletons and pairs and a DAG data-structure for
counting longer itemsets. We have shown that these choices are
optimal in that only required itemsets are enumerated and
that the cost of enumerating each itemset is . Our
experimental results showed that there was a substantial gap
between the performance of current mining algorithms and
that of the Oracle.</p>
      <p>We also presented a new online mining algorithm called
ARMOR (Association Rule Mining based on ORacle), that
was constructed with minimal changes to Oracle to result
in an online algorithm. ARMOR utilizes a new method of
candidate generation that is dynamic and incremental and is
guaranteed to complete in two passes over the database. Our
experimental results demonstrate that ARMOR performs
within a factor of two of Oracle.</p>
      <p>Acknowledgments This work was partially supported by
a Swarnajayanti Fellowship from the Dept. of Science and
Technology, Govt. of India.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Eachmovie collaborative filtering data set</article-title>
          . http://www.research.compaq.com/SRC/eachmovie/,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>In Proc. of Intl. Conf. on Very Large Databases (VLDB)</source>
          ,
          <year>Sept</year>
          .
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Blake</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Merz</surname>
          </string-name>
          .
          <article-title>UCI repository of machine learning databases</article-title>
          . http://www.ics.uci.edu/ mlearn/MLRepository. html,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation</article-title>
          .
          <source>In Proc. of ACM SIGMOD Intl. Conf. on Management of Data</source>
          , May
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hidber</surname>
          </string-name>
          .
          <article-title>Online association rule mining</article-title>
          .
          <source>In Proc. of ACM SIGMOD Intl. Conf. on Management of Data</source>
          ,
          <year>June 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Dunham</surname>
          </string-name>
          .
          <article-title>Mining association rules: Antiskew algorithms</article-title>
          .
          <source>In Proc. of Intl. Conf. on Data Engineering (ICDE)</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pudi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          .
          <article-title>Quantifying the utility of the past in mining large databases</article-title>
          .
          <source>Information Systems</source>
          ,
          <year>July 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pudi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          .
          <article-title>On the optimality of association-rule mining algorithms</article-title>
          .
          <source>Technical Report TR-2001-01</source>
          , DSL, Indian Institute of Science,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Savasere</surname>
          </string-name>
          , E. Omiecinski, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Navathe</surname>
          </string-name>
          .
          <article-title>An efficient algorithm for mining association rules in large databases</article-title>
          .
          <source>In Proc. of Intl. Conf. on Very Large Databases (VLDB)</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sudarshan</surname>
          </string-name>
          , G. Bhalotia,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bawa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Turbo-charging vertical mining of large databases</article-title>
          .
          <source>In Proc. of ACM SIGMOD Intl. Conf. on Management of Data</source>
          , May
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining generalized association rules</article-title>
          .
          <source>In Proc. of Intl. Conf. on Very Large Databases (VLDB)</source>
          ,
          <year>Sept</year>
          .
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Dunham</surname>
          </string-name>
          .
          <article-title>Considering main memory in mining association rules</article-title>
          .
          <source>In Proc. of Intl. Conf. on Data Warehousing and Knowledge Discovery (DAWAK)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Gouda</surname>
          </string-name>
          .
          <article-title>Fast vertical mining using diffsets</article-title>
          .
          <source>Technical Report 01-1</source>
          , Rensselaer Polytechnic Institute,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Mason</surname>
          </string-name>
          .
          <article-title>Real world performance of association rule algorithms</article-title>
          .
          <source>In Proc. of Intl. Conf. on Knowledge Discovery and Data Mining (KDD)</source>
          ,
          <year>Aug</year>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>