<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reducing the Main Memory Consumptions of FPmax* and FPclose</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>G o ̈sta Grahne and Jianfei Zhu Concordia University Montreal</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2000</year>
      </pub-date>
      <abstract>
        <p>2. Improving the Main Memory Requirements In [4], we gave FPgrowth*, FPmax* and FPclose for mining all, maximal and closed frequent itemsets, respectively. In this short paper, we describe two approaches for improving the main memory consumptions of FPmax* and FPclose. Experimental results show that the two approaches successfully reduce the main memory requirements of the two algorithms, and that in particular one of the approaches does not incur any practically significant extra running time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In FIMI’03 [2], many implementations of algorithms for
mining all, maximal and closed frequent itemsets were
submitted and tested independently by the organizers. The
experimental results in [3] showed that our algorithms
FPgrowth*, FPmax* and FPclose [4] have great performance
on most datasets, and that FPmax* and FPclose were among
the fastest implementations. Our experimental results in [7]
also showed that the three algorithms are among the
algorithms that consume the least amount of main memory when
running them on dense datasets.</p>
      <p>However, we also found in [7] that FPmax* and FPclose
require much more main memory than other algorithms in
[2] especially when the datasets are sparse. This is because
the FP-trees constructed from the sparse datasets sometimes
are fairly big, and they are stored in main memory for the
entire execution of the algorithms FPmax* and FPclose.
The sizes of some auxiliary data structures for storing
maximal and closed frequent itemsets, the MFI-trees and the
CFI-trees also always increase even though many nodes in
the trees become useless.</p>
      <p>In this short paper, we describe two approaches for
reducing the main memory usages of FPmax* and FPclose.
We also give the experimental results which show that
FPmax* with either of the approaches needs less main memory
for running on both synthetic dataset and real dataset.</p>
      <p>We first give a detailed introduction to the main memory
requirements of the two algorithm implementations in [4].
Then two approaches for improving the main memory
consumption are introduced. Since FPmax* and FPclose have
similar main memory requirements, here we only consider
main memory improvements in the implementation of
FPmax*.</p>
    </sec>
    <sec id="sec-2">
      <title>The Basic Case</title>
      <p>In [4], when implementing FPgrowth*, FPmax* and
FPclose, each node P of an FP-tree, an MFI-tree or a CFI-tree
has 4 pointers that point to its parent node, left-child node,
right-sibling node and the next node that corresponds to the
same itemname as P . The left-child node and right-sibling
node pointers are set for the tree construction. The parent
node pointer and next node pointer in an FP-tree are used
for finding the conditional pattern base of an item. In
MFItrees and CFI-trees, they are used for maximality checking
and closedness testing.</p>
      <p>In all algorithms, the FP-tree T∅ constructed from the
original database D is always stored in main memory
during the execution of the algorithms. For FPmax*, during
the recursive calls, many small FP-trees and MFI-trees will
be constructed. The biggest MFI-tree is M∅ whose size
increases slowly. At the end of the call of FPmax*(T∅, M ),
∅
M∅ stores all maximal frequent itemsets mined from D.</p>
      <p>We can see that the main memory requirement of
FPmax* in the basic case is at least the size of T∅ plus the size
of M∅ which contains all maximal frequent itemsets in D.</p>
    </sec>
    <sec id="sec-3">
      <title>Approach 1: Trimming the FP-trees and MFI-trees</title>
    </sec>
    <sec id="sec-4">
      <title>Continuously</title>
      <p>To see if we can reduce the main memory requirement
of FPmax*, let’s analyze FPmax* first.</p>
      <p>Suppose during the execution of FPmax*, an FP-tree T
and its corresponding MFI-tree M are constructed. The
items in T and M are i1, i2, . . . , in in decreasing order of
their frequency. Note that the header tables of T and M
have the same items and item order. Starting from the least
frequent item in, FPmax* mines maximal frequent
itemsets from T . A candidate frequent itemset X is compared
with the maximal frequent itemsets in M . If X is
maximal, X is inserted into M . When processing the item
ik, FPmax* needs the frequency information that contains
only items i1, i2, . . . , ik−1, and the frequency information
of ik+1, . . . , in will not be used any more. In other words,
in T , only the nodes that correspond to i1, i2, . . . , ik are
useful, and the nodes corresponding to ik+1, . . . , in can be
deleted from T . If a candidate maximal frequent itemset
X is found, X must be a subset of i1, i2, . . . , ik. Thus
in M , only the nodes corresponding to i1, i2, . . . , ik are
used for maximality checking, and the nodes
corresponding to ik+1, . . . , in will never be used, and therefore can be
deleted.</p>
      <p>Based on the above analysis, we can reduce the main
memory requirement of FPmax* by continuously trimming
the FP-trees and MFI-trees. After processing an item ik, all
ik-nodes in T and M are deleted. This can be done by
following the head of the link list from ik in the header tables
T.header and M.header. Remember that the children of a
node are organized by a right-sibling linked list. To speed
up the deletions we make this list doubly linked, i.e. each
node has pointers both to its right and left siblings.</p>
      <p>Before calling FPmax*, T∅ has to be stored in the main
memory. By deleting all nodes that will not be used any
more, the sizes of FP-trees, especially the size of T∅,
become smaller and smaller. The sizes of the MFI-trees still
increase because new nodes for new maximal frequent
itemsets are inserted, however, since obsolete nodes are also
deleted, the MFI-trees will grow more slowly. At the end
of the call of FPmax*, the sizes of T∅ and M∅ are all zero.
We assume that the sizes of the recursively constructed
FPtrees and MFI-trees are always far smaller than the size of
the top-level trees T∅ and M∅, and that the main memory
consumption of these trees can be neglected. Besides T∅,
the main memory also stores M . At the initial call of
FP∅
max*, the size of M∅ is zero. Then M∅ never reaches its
full size because of the trimming. We estimate that the
average main memory requirement of FPmax* with approach
1 is the size of T∅ plus half of the size of M .
∅</p>
      <p>In [4], we mentioned that we can allocate a chunk of
main memory for an FP-tree, and delete all nodes in the
FP-tree at a time by deleting the chunk. Time is saved by
avoiding deleting the nodes in the FP-tree one by one.
Obviously, this technique can not be used parallel with approach
1. Therefore, FPmax* with approach 1 will be slower than
the basic FPmax*, but its peak main memory requirement
will be smaller than that of the basic FPmax*.</p>
    </sec>
    <sec id="sec-5">
      <title>Approach 2: Trimming the FP-trees and MFI-trees</title>
    </sec>
    <sec id="sec-6">
      <title>Once</title>
      <p>In approach 2, we use the main memory management
technique by trimming the FP-trees and MFI-trees only
once. We still assume that main memory consumption of
the recursively constructed FP-trees and MFI-trees can be
neglected, and only the FP-tree T∅ and the MFI-tree M∅ are
trimmed.</p>
      <p>Suppose the items in T∅ and M∅ are i1, i2, . . . , in. In our
implementation, we allocate a chunk of main memory for
those nodes in T∅ and M∅ that correspond to ibn/2c, . . . , in.
The size of the chunk is changeable. During the execution
of FPmax*, T∅ and M∅ are not trimmed until item ibn/2c in
T∅.header is processed. The main memory of the chunk is
freed and all notes in the chunk are deleted at that time.</p>
      <p>In this approach, before processing ibn/2c and freeing the
chunk, T∅ and a partial M∅ are stored in the main memory.
On the average, the size of M∅ is half of the size of the full
M . After freeing the chunk, new nodes for new maximal
∅
frequent itemsets are inserted and they are never trimmed.
However, considering the fact that MFI-tree structure is a
compact data structure, the new nodes are for the bn/2c
most frequent items, and M∅ already has many branches for
those nodes before trimming, we can expect that the size of
M∅ will be a little bit more than half of the size of the
complete M∅. Therefore the peak main memory consumption is
a little bit more than the size of T∅ plus half of the size of
M∅. Compared with approach 1, the FPmax* with approach
2 is faster but consumes somewhat more main memory.</p>
    </sec>
    <sec id="sec-7">
      <title>3. Experimental Evaluation</title>
      <p>We now present a comparison of the runtime and main
memory consumptions of the basic case and the two
approaches. We ran the three implementations of FPmax* on
many synthetic and real datasets. The synthetic datasets are
sparse datasets, and the real datasets are all dense. Due to
the lack of space, only the results for one synthetic dataset
and one real dataset are shown here.</p>
      <p>The synthetic dataset T20I10D200K was generated from
the application on the website [1]. It contains 200,000
transactions and 1000 items. The real dataset pumsb* was
downloaded from the FIMI’03 website [2]. It was produced from
census data of Public Use Microdata Sample (PUMS).</p>
      <p>All experiments were performed on a 1GHz Pentium III
with 512 MB of memory running RedHat Linux 7.3.</p>
      <p>Figure 1 shows the runtime and the main memory usage
of running FPmax* with the implementations of the basic
case and the two approaches on the dataset T20I10D200K.
As expected, in the runtime graph, FPmax* with approach
1 took the longest time. Its runtime is almost twice the
runtime of the basic case and approach 2. However, approach 1
Main Memory Consumption
1 1
100
)(
s
item10
unR
0 1
100
10
Basic
Approach1
Approach2</p>
      <p>Pumsb*
160 50 40 30 20 10 0 1</p>
      <p>Minimum Support(%)
Runtime</p>
      <p>Pumsb*
consumes the least amount of main memory. The peak main
memory of approach 1 is always less than the basic case for
about 10 megabytes, or about 15%. The speed of approach
2 is similar to that of the basic case, since approach 2 only
trims the FP-tree T∅ and the MFI-tree M∅ once. The main
memory consumption of approach 2 is similar to that of
approach 1, which means the approach 2 successfully saves
main memory.</p>
      <p>The runtime and main memory usage of running
FPmax* on real dataset pumsb* are shown in Figure 2. The
results are similar to those results on synthetic dataset.
Dataset pumsb* is a very dense dataset, its FP-trees and
MFI-trees have very good compactness, and there are not
many nodes in the trees. Therefore, in the two graphs in
Figure 2, the differences of the runtime and main memory
consumptions for the basic case and the two approaches are
not very big.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Conclusions</title>
      <p>We have analyzed the main memory requirements of
the FPmax* and FPclose implementation in [4]. Two
approaches for reducing the main memory requirements of
FPmax* and FPclose are introduced. Experimental results
show that both approach 1 and approach 2 successfully
decrease the main memory requirement of FPmax*. While the
continuous trimming of the trees in approach 1 slows down
the algorithm, the “one-time-trimming” used in approach 2
shows speed similar to the original method.</p>
      <p>We also noticed that the PatriciaMine in [6] using
Patricia trie structure to implement the FP-growth method [5]
shows great speed and less main memory requirement. We
are currently considering implementing FPmax* and
FPclose using a Patrica trie.
[1] http://www.almaden.ibm.com/software
/quest/Resources/index.shtml.
[2] http://fimi.cs.helsinki.fi.
[3] B. Goethals and M. J. Zaki (Eds.).
Proceedings of the First IEEE ICDM Workshop on
Frequent Itemset Mining Implementations (FIMI
’03). CEUR Workshop Proceedings, Vol 80
http://CEUR-WS.org/Vol-90.
[4] G. Grahne and J. Zhu. Efficiently using
prefixtrees in mining frequent itemsets. In
Proceedings of the 1st IEEE ICDM Workshop on Frequent
Itemset Mining Implementations (FIMI’03),
Melbourne, FL, Nov. 2003.
[6] A. Pietracaprina and D. Zandolin. Mining
frequent itemsets using Patricia tries. In Proceedings
of the 1st Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov.
2003.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>