nonordfp: An FP-Growth Variation without Rebuilding the FP-Tree

                                                  Balázs Rácz
                                   Computer and Automation Research Institute
                                     of the Hungarian Academy of Sciences
                                     H-1111 Budapest, Lágymanyosi u. 11.
                                            bracz+fm4@math.bme.hu


                        Abstract                                ditional frequencies. Hence the acronym of the algorithm:
                                                                nonordfp.
    We describe a frequent itemset mining algorithm and            We describe implementational details, data structure
implementation based on the well-known algorithm FP-            traversal routines, memory allocation scheme, library func-
growth. The theoretical difference is the main data structure   tions and I/O acceleration, among with the algorithmic pa-
(tree), which is more compact and which we do not need to       rameters of our implementation that control different traver-
rebuild for each conditional step. We thoroughly deal with      sal functions and projection. The implementation is freely
implementation issues, data structures, memory layout, I/O      available for research purposes, aimed not only for perfor-
and library functions we use to achieve comparable per-         mance comparison but for further tuning of these theoretical
formance as the best implementations of the 1st Frequent        parameters.
Itemset Mining Implementations (FIMI) Workshop.
                                                                2. Overview of the algorithm and data struc-
                                                                   tures
1. Introduction
                                                                As a preprocession the database is scanned and the
   Frequent Itemset Mining (FIM) is one of the first and        global frequencies of the items are counted. Using the
thus most well studied problems of data mining. From the        minimum support infrequent items are erased and frequent
many published algorithms for this task, pattern growth ap-     items are renumbered in frequency-descending order. Dur-
proaches (FP-growth and its variations) were among the          ing a second scan of the database all transactions are pre-
best performing ones.                                           processed: infrequent items are erased, frequent items are
   This paper describes an implementation of a pattern          translated and sorted according to the new numbering. Then
growth algorithm. We assume the reader is familiar with the     the itemset is inserted into a temporary trie.
problem of frequent itemset mining[2] and pattern growth            This trie is similar to the classic FP-tree: each node con-
algorithms[5], like FP-growth, and hence we will omit their     tains an item identifier, a counter, a parent pointer and a chil-
description here. For the reasons and goals for analyzing       dren map. The children map is an unordered array of pairs
implementation issues of the FIM problem, see the intro-        (child item identifier, child node index). Lookup is done
duction to the 1st FIMI workshop [3].                           with linear scan. Though this is asymptotically not an op-
   Our implementation is based on a variation of FP-tree,       timal structure, the number of elements in a single children
a similar data structure than used by FP-growth, but with a     map is expected to be very small, linear scan has the least
more compact representation that allows faster allocation,      overhead compared to ordered arrays with binary search,
traversal, and optionally projection. It maintains less ad-     or search trees/hash maps. The implementation uses syn-
ministrative information (the nodes do not need to store        tactics that are equivalent to the Standard Template Library
their labels (item identifiers), no header lists and children   (STL) interface pair-associative container thus it is easy to
mappings are required, only counters and parent pointers),      exchange this to the RB-tree based STL map or hash map.
and allows more recursive steps to be carried out on the        It results in a slight performance decrease due to data struc-
same data structure, without the need to rebuild it. There      ture overhead.
are also drawbacks of never rebuilding the tree: though pro-        As a final step in the preprocessing phase this trie is
jection is possible to filter conditionally infrequent items,   copied into the data structure that the core of the algorithm
the order of items cannot be changed to adapt to the con-       will use, which we will describe later.
The core algorithm consists of a recursion. In each step              conditional counter is added to the counter of the par-
the input is a condition (an itemset), a trie structure and an        ent. This is the default aggregation algorithm and it is
array of counters that describe the conditional frequencies           very fast due to the memory layout of the data struc-
of the trie nodes. In the body we iterate through the remain-         ture, described later.
ing items, calculate the conditional counters for the input
condition extended with that single item, and call the recur-
sion with the new counters and with the original or a new,         • Single node optimization is used near the last levels
projected structure, depending on the projection configura-          of recursion, when there is at most one node for each
tion and the percentage of empty nodes. The core recursion           item left in the tree. (This is a slight generalization of
is shown as Algorithm 1.                                             the tree being a single chain.) In this case no aggre-
                                                                     gation and calculation of new counters is needed, so a
Algorithm 1 Core algorithm                                           specialized very simple recursive procedure starts that
Recursion(condition, nextitem, structure, counters):                 outputs all subsets of the paths in the tree as a frequent
  for citem=nextitem-1 downto 0 do                                   itemset.
    if support of citem < min supp then
       continue at next citem
    end if
    newcounters=aggregate conditional pattern base for           The core data structure is a trie. Each node contains
    condition ∪ citem                                            a counter and a pointer to the parent. As the trie is never
    if projection is beneficial then                             searched, only traversed from the bottom to the top, child
       newstructure=projection of structure to newcoun-          maps are not required. The nodes are stored in an array,
       ters                                                      node pointers are indices to this array.
       Recursion(condition∪citem, citem, newstructure,              Nodes that are labelled with the same item occupy a con-
       newcounters)                                              secutive part of this array, this way we do not need to store
    else                                                         the item identifiers in the nodes. Furthermore, we do not
       Recursion(condition∪citem, citem, structure, new-         need the header lists, as processing all nodes of a specified
       counters)                                                 item requires traversing an interval of this array. This also
    end if                                                       allows faster execution as only contiguous memory reads
  end for                                                        are executed. We only need one memory cell per frequent
                                                                 item to store the starting points of these intervals (the item-
   The recursion has four different implementations, that        starts array).
suit differently sized FP trees:
                                                                    The parent pointers (indices) and the counters are stored
  • Very large FP trees that contain millions of nodes are       in separate arrays (parents and counters rsp.) to fit the core
    treated by simultaneous projection: the tree is tra-         algorithm’s flexibility: if projection is not beneficial, then
    versed once and a projection to each item is calculated      the recursion proceeds with the same structural information
    simultaneously. This phase is applied only at the first      (parent pointers) but a new set of counters.
    level of recursion; very large trees are expected to arise       The item intervals of the trie are allocated in the array as-
    from sparse databases, like real market basket data;         cending, in topological order. This way the bottom-up and
    conditional trees projected to a single item are already     top-down traversal of the trie is possible with a descend-
    small in this case.                                          ing rsp. ascending iteration through the array of the trie,
                                                                 still only using contiguous memory reads and writes. This
  • Sparse aggregate is an aggregation and projection al-        order also allows the truncation of the tree to a particular
    gorithm that does not traverse those part of the tree that   level/item: if the structure is not rebuilt but only a set of
    will not exist in the next projection. To achieve this, a    conditional counters is calculated for an item, then the re-
    linked list is built dynamically that contains the indices   cursion can proceed with a smaller sized newcounters array
    to non-zero counters. This is similar to the header lists    and the original parents and itemstarts array.
    of FP-trees. This aggregation algorithm is used typ-
    ically near the top of the recursion, where the tree is         The pseudocode for conditional aggregation and projec-
    large and many zeroes are expected. The exact choice         tion is shown as Algorithm 2 and 3. Some details are not
    is tunable with parameters.                                  shown, for example during the aggregation phase we cal-
                                                                 culate the expected size of the projected structure to allow
  • Dense aggregate is the default aggregation algorithm.        decision about the projection benefits and to allocate arrays
    Each node of the tree is visited exactly once and its        for the projected structure.
Algorithm 2 Aggregation on the compact trie data structure                 Algorithm 3 Projection on the compact trie data structure
cpb-aggregate(item, parents, itemstarts, counters, new-                    project(item, parents, itemstarts, newcounters, condfreqs,
counters, condfreqs):                                                      newparents, newitemstarts, newnewcounters):
Input: item is the identifier of the item to add to the cur-               Input: newcounters and condfreqs as computed by the ag-
rent condition; parents and itemstarts describe the current                gregation algorithm; newparents and newitemstarts will
structure of the tree; counters and newcounters hold the cur-              hold the projected structure; newnewcounters will hold the
rent and new conditional counters of the nodes: counters is                values of newcounters reordered accordingly. The array
an itemstarts[item+1] sized array, newcounters is an item-                 newcounters is reused during the algorithm to store the old
starts[item] sized array; condfreqs will hold the new condi-               position to new position mapping.
tional frequencies of the items. This is the default (dense)                 newcounters[0]=0 /*node 0 is reserved for the root*/
aggregation algorithm.                                                       nn=1 /*the next free node index*/
   fill newcounters and condfreqs with zeroes                                for citem=0 to item-1 do
   for n=itemstarts[item] to itemstarts[item+1]-1 do                            newitemstarts[citem]=nn
       newcounters[parents[n]]=counters[n]                                      for n=itemstarts[citem] to itemstarts[citem+1]-1 do
   end for                                                                         if condfreqs[citem]<min supp OR newcoun-
   for citem=item-1 downto 0 do                                                    ters[n]==0 then
       for n=itemstarts[citem] to itemstarts[citem+1]-1 do                            newcounters[n]=newcounters[parents[n]] /*skip
         newcounters[parents[n]]+=newcounters[n]                                      this node, the new position will be the same as
         condfreqs[citem]+=newcounters[n]                                             the parent’s*/
       end for                                                                     else
   end for                                                                            newnewcounters[nn]=newcounters[n]
                                                                                      newcounters[n]=nn /*save the position map-
                                                                                      ping*/
3. Auxiliary routines and optimization: what                                          newparents[nn]=newcounters[parents[n]] /*re-
    counts and what doesn’t                                                           trieve new position of parent from the saved
                                                                                      mapping*/
    A very important observation is, that in a first and                              nn++
straightforward implementation of most FIM algorithms the                          end if
library and auxiliary routines take 70-90% of the running                       end for
time. Therefore it is essential that these tasks and routines                end for
get extra attention, especially in a FIM contest, like the                   newitemstarts[item]=nn
FIMI workshop, where actual running times are measured
and every millisecond counts.1
    These auxiliary routines include all C/C++ library
calls, memory allocation, input/output implementation, data
structure management (including initialization, copy con-
structors, etc.). Instead of reciting some general “rule of                dures customized for the simple format of FIMI. The most
thumbs” we describe our implementation about these is-                     important optimization is that the output routine follows the
sues. The most important issues are posed by those aux-                    recursive traversal of the itemset search space, and the text
iliary routines that appear in the inner recursion, and thus               format of the previously outputted itemset is reused for the
are called proportionally to the core running time.                        next output line. The library calls are eliminated or simpli-
                                                                           fied as much as possible (for example outputting a zero-
3.1. Input/Output                                                          terminated string is approximately 20% slower than out-
                                                                           putting a character sequence of a known length due to the
   The input parsing code released for FIMI’03 is a very                   additional strlen). This optimization is essential for the
well written, low-level implementation, the only relevant                  very low support test cases, where up to gigabytes of output
change we made to it is that we added a buffer of several                  strings are rendered.
megabytes to the input file to avoid OS overhead.
   The output routine was completely rewritten. The very                      The performance comparison of the different output
slow fprintf calls are eliminated, and replaced by proce-                  routines is shown on Figure 1. Output was directed to
   1 This leads to an unfortunate bias: a very good low level programmer   /dev/null, the core running time is shown on line no out-
implementing a fairly good algorithm can spectacularly defeat a FIM ex-    put, where no other optimization is employed but to avoid
pert implementing the best algorithm on a higher level.                    rendering the frequent itemsets to text.
                                                              connect.dat                                  larger memory block can be used for the smaller array, oth-
                                1000
                                                                                         no_output
                                                                                        opt_output
                                                                                                           erwise expensive reallocation would be necessary. This was
                                                                                        orig_output
                                                                                                           already taken into consideration in Algorithm 1.
                                 100
                                                                                                               Another important factor is when and how to zero the al-
                                                                                                           located/reused memory blocks. In our first implementation
run time (seconds, log−scale)


                                                                                                           all allocated memory was filled with zero before use; this
                                  10
                                                                                                           resulted in up to three times more time spent in the memory
                                                                                                           fill procedure than the core recursion. This was eliminated
                                                                                                           by carefully analyzing which arrays need to be filled with
                                   1
                                                                                                           zero before use. In some cases it was faster to clean up the
                                                                                                           array after use than to fill with zero before the next use (in
                                                                                                           the case when the array is sparsely filled and we have a list
                                 0.1
                                                                                                           of non-zero elements). This way the total amount of mem-
                                       90   85   80   75          70
                                                           min_supp (percent)
                                                                              65   60           55    50
                                                                                                           ory zeroed was reduced (database connect.dat, min supp
                                                                                                           50%, 88 million frequent itemsets) from 54 gigabytes (!) to
                                   Figure 1. Performance of output routines                                915 megabytes.
                                                                                                               This scheme is implemented as and supplemented by
                                                                                                           several block-oriented memory allocators. An important
3.2. Memory allocation scheme                                                                              side effect of these allocators is, that the initial memory al-
                                                                                                           located by the program is high (up to 100 MB) even on very
    Another, similarly important point is memory allocation.                                               small datasets where it is not used completely. This is not
In each recursive step (the number of which is equal to the                                                a performance issue, the OS should be able to satisfy allo-
total number of frequent itemsets written to output, up to                                                 cated but otherwise unused memory from swap space, and
tens of millions) several arrays are allocated for the condi-                                              this parameter may be tuned if the program has to be run on
tional counters and possibly the projected structure. Calling                                              computers with very limited amount of memory.
malloc and free, or new and delete incurs a consider-                                                          The effect of memory allocation scheme on running time
able overhead of these library functions due to their general                                              is shown on Figure 2.
memory management possibilities. Thus it is essential to
                                                                                                                                                                 connect.dat
reuse allocated memory.                                                                                                                    1000
                                                                                                                                                                                      opt_alloc
    The best approach would be to allocate these arrays for                                                                                                                         plain_alloc


each level of the recursion in advance, but as we do not
know the required size, and it can be upper bounded only by                                                                                 100
                                                                                                           run time (seconds, log−scale)


the size of the full tree, we would run out of main memory.
    There are two main observations that lead to our solution
to this issue. First, an allocated array can be reused for the                                                                               10


same array in any later recursive call. Second, as the recur-
sion proceeds into deeper levels, the required size of arrays
decreases monotonically. (This is due to projections and the                                                                                  1


layout of the trie in the arrays, as discussed earlier.) Thus
upon exit from a recursive step we can push the memory ar-
rays on a stack of free blocks, this way the stack will contain                                                                             0.1
                                                                                                                                                  90   80    70                60   50            40

decreasingly sized blocks. When entering a new level of re-                                                                                                   min_supp (percent)


cursion we check if the block on top of the stack is large
                                                                                                                                           Figure 2. Performance of memory allocation
enough. If yes, we can pop the stack and proceed. Oth-
                                                                                                                                           schemes
erwise we allocate a new memory block and proceed with
an unmodified stack to the next level. Thus the monotonic-
ity remains intact, only the free blocks are shifted one level
downwards into the recursion.                                                                              3.3. Projection configuration
    In each recursive call we enter the next level several
times (depending on the number of remaining items), these                                                     During the conditional pattern base aggregation phase
require differently sized arrays for the next level of recur-                                              we can easily calculate the expected size of the projected
sion. It is important, that we go from the largest to the small-                                           dataset, which will be the number of nodes visited with non-
est when iterating through these possibilities. This way a                                                 zero newcounters value.
    Based on this information and the current recursion level      • projpercent is set to 90% (project the tree if it will
we do projection iff the recursion level is smaller than proj-       shrink to at most tenth of its size),
level, or the percentage of empty nodes exceeds projpercent,
                                                                   • densepercent is set to 70%, densesize is set to 5000
where projlevel and projpercent are tunable parameters of
                                                                     (switch from sparse to dense aggregation algorithm if
the algorithm. projlevel = 0 and projpercent = 100 means
                                                                     there is less than 70% empty nodes, or less than 5000
do not do projection, projpercent = 0 means do every pro-
                                                                     nodes),
jection.
    We must note that “projection” here means Algo-                • simultprojthres to 1 million (do simultaneous projec-
rithm 3, which differs from the original concept of pro-             tion on the first level of the recursion if the tree size
jected database/projected tree as premised in the introduc-          exceeds 1 million nodes).
tion. Projection here does mean to compact the tree by
                                                                     These values can be set in the beginning of the main pro-
eliminating infrequent items and nodes that have zero con-
                                                                 gram or a run-time configuration file, and should be subject
ditional frequency, but it does not reorder the items to con-
                                                                 to further, extensive tuning over the parameter space, which
tinue further recursions with the conditionally most infre-
                                                                 was beyond the possibilities and the time-frame available to
quent items, nor does it combine those nodes of the tree
                                                                 the author. Also, on different datasets different tuning pa-
that have the same label and the same path to the root (e.g.
                                                                 rameters may give the best performance or memory usage.
their respective transactions differ only in conditionally in-
frequent items).
    The effect of some projection configurations on running      5. Conclusion and further work
time is shown on Figure 3. In the line captions the first
number is projlevel, while the second is projpercent. The            We described an implementation of a pattern growth-
figure shows that on many databases the projection bene-         based frequent itemset mining algorithm. We showed a
fits and projection costs are surprisingly well balanced, thus   compact, memory efficient representation of an FP-tree that
projection adds little enhancement to the core algorithm.        supports the most important requirements of the core algo-
                                                                 rithm, with a memory layout that allows fast traversal.
                                                                     The implementation based on this data structure and sev-
4. Performance comparison                                        eral further optimizations in the auxiliary routines performs
                                                                 well against some of the best competitors of the 1st Frequent
   In this section we evaluate the performance of our imple-
                                                                 Itemset Mining Implementations Workshop.
mentation. We use publicly available databases accidents,
                                                                     The data structure presented here can accommodate
connect, pumsb and retail to compare the running time of
                                                                 the top-down recursion approach, thereby further reducing
our implementation to a few competitors (including the best
                                                                 memory need and computation time.
performing ones) of the 1st Frequent Itemset Mining Im-
plementations Workshop, fpgrowth* [4], patricia [6], and
eclat goethals.                                                  References
   All of the following tests were run on a 3.0 GHz
                                                                 [1] FIMI repository. http://fimi.cs.helsinki.fi.
(FSB800) Intel Pentium 4 processor (hyperthreading               [2] B. Goethals. Survey on frequent pattern mining. Technical
disabled) with 2 gigabytes of dual-channel DDR400                    report, Helsinki Institute for Information Technology, 2003.
main memory and Linux OS. Output was redirected to               [3] B. Goethals and M. J. Zaki. Advances in frequent item-
/dev/null. The running times of different implementa-                set mining implementations: Introduction to f imi03. In
tion on the test datasets are displayed on Figure 4.                 B. Goethals and M. J. Zaki, editors, Proceedings of the IEEE
   The figures show that on dense datasets the fast traversal        ICDM Workshop on Frequent Itemset Mining Implementa-
routines take advantage, while on sparse datasets the perfor-        tions (FIMI’03), volume 90 of CEUR Workshop Proceedings,
mance is still competitive.                                          Melbourne, Florida, USA, 19. November 2003.
                                                                 [4] G. Grahne and J. Zhu. Efficiently using prefix-trees in mining
   On sparse datasets the first level of recursion dominates
                                                                     frequent itemsets. In Proceedings of the IEEE ICDM Work-
the running time. To achieve better performance for these            shop on Frequent Itemset Mining Implementations, 2003.
cases a specialized data structure could be employed in          [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns with-
the simultaneous projection phase that adapts better to the          out candidate generation. In Proceedings of the 2000 ACM
skewedness of sparse datasets.                                       SIGMOD international conference on Man agement of data,
   The final submitted version of the implementation (as             pages 1–12. ACM Press, 2000.
available in the FIMI repository [1]) uses the following tun-    [6] A. Pietracaprina and D. Zandolin. Mining frequent itemsets
ing parameters:                                                      using patricia tries. In B. Goethals and M. J. Zaki, editors,
                                                                     Proceedings of the IEEE ICDM Workshop on Frequent Item-
  • projlevel is set to 0 (do not do any projection automat-         set Mining Implementations, Melbourne, FL, USA, Novem-
    ically based on the level of recursion),                         ber 2003.
                                                                    accidents.dat                                                                                                accidents.dat
                                1000                                                                                                              1000
                                                                                 nonordfp_0_100                                                                  eclat
                                                                                  nonordfp_2_30                                                                     fp*
                                                                                 nonordfp_3_100                                                               nonordfp
run time (seconds, log−scale)


                                                                                                                 run time (seconds, log−scale)
                                                                                  nonordfp_3_50                                                                patricia
                                 100                                                                                                              100


                                 10                                                                                                                10


                                  1                                                                                                                 1


                                 0.1                                                                                                               0.1
                                       30            25            20          15              10          5                                             50      45       40   35     30      25    20   15   10
                                                                 min_supp (percent)                                                                                            min_supp (percent)

                                                                     connect.dat                                                                                                  connect.dat
                                1000                                                                                                              1000
                                                                                 nonordfp_0_100                                                                  eclat
                                                                                  nonordfp_2_30                                                                     fp*
                                                                                 nonordfp_3_100                                                               nonordfp
run time (seconds, log−scale)


                                                                                                                 run time (seconds, log−scale)
                                                                                  nonordfp_3_50                                                                patricia
                                 100                                                                                                              100


                                 10                                                                                                                10


                                  1                                                                                                                 1


                                 0.1                                                                                                               0.1
                                       70                 65            60                55              50                                             90      85       80   75     70      65    60   55   50
                                                                 min_supp (percent)                                                                                            min_supp (percent)

                                                                     kosarak.dat                                                                                                   pumsb.dat
                                1000                                                                                                              1000
                                             nonordfp_0_100                                                                                                      eclat
                                              nonordfp_2_30                                                                                                         fp*
                                             nonordfp_3_100                                                                                                   nonordfp
run time (seconds, log−scale)


                                                                                                                 run time (seconds, log−scale)


                                              nonordfp_3_50                                                                                                    patricia
                                 100                                                                                                              100


                                 10                                                                                                                10


                                  1                                                                                                                 1


                                 0.1                                                                                                               0.1
                                       0.5    0.45   0.4       0.35   0.3   0.25    0.2   0.15      0.1   0.05                                           75               70          65            60        55
                                                                  min_supp (percent)                                                                                           min_supp (percent)

                                                                     pumsb.dat                                                                                                     retail.dat
                                1000                                                                                                              1000
                                                                                 nonordfp_0_100                                                                  eclat
                                                                                  nonordfp_2_30                                                                     fp*
                                                                                 nonordfp_3_100                                                               nonordfp
run time (seconds, log−scale)


                                                                                                                 run time (seconds, log−scale)


                                                                                  nonordfp_3_50                                                                patricia
                                 100                                                                                                              100


                                 10                                                                                                                10


                                  1                                                                                                                 1


                                 0.1                                                                                                               0.1
                                       70                 65            60                55              50                                          0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01   0
                                                                 min_supp (percent)                                                                                       min_supp (percent)


                   Figure 3. Performance of projection configs                                                                                   Figure 4. Performance comparison charts