nonordfp: An FP-Growth Variation without Rebuilding the FP-Tree Balázs Rácz Computer and Automation Research Institute of the Hungarian Academy of Sciences H-1111 Budapest, Lágymanyosi u. 11. bracz+fm4@math.bme.hu Abstract ditional frequencies. Hence the acronym of the algorithm: nonordfp. We describe a frequent itemset mining algorithm and We describe implementational details, data structure implementation based on the well-known algorithm FP- traversal routines, memory allocation scheme, library func- growth. The theoretical difference is the main data structure tions and I/O acceleration, among with the algorithmic pa- (tree), which is more compact and which we do not need to rameters of our implementation that control different traver- rebuild for each conditional step. We thoroughly deal with sal functions and projection. The implementation is freely implementation issues, data structures, memory layout, I/O available for research purposes, aimed not only for perfor- and library functions we use to achieve comparable per- mance comparison but for further tuning of these theoretical formance as the best implementations of the 1st Frequent parameters. Itemset Mining Implementations (FIMI) Workshop. 2. Overview of the algorithm and data struc- tures 1. Introduction As a preprocession the database is scanned and the Frequent Itemset Mining (FIM) is one of the first and global frequencies of the items are counted. Using the thus most well studied problems of data mining. From the minimum support infrequent items are erased and frequent many published algorithms for this task, pattern growth ap- items are renumbered in frequency-descending order. Dur- proaches (FP-growth and its variations) were among the ing a second scan of the database all transactions are pre- best performing ones. processed: infrequent items are erased, frequent items are This paper describes an implementation of a pattern translated and sorted according to the new numbering. Then growth algorithm. We assume the reader is familiar with the the itemset is inserted into a temporary trie. problem of frequent itemset mining[2] and pattern growth This trie is similar to the classic FP-tree: each node con- algorithms[5], like FP-growth, and hence we will omit their tains an item identifier, a counter, a parent pointer and a chil- description here. For the reasons and goals for analyzing dren map. The children map is an unordered array of pairs implementation issues of the FIM problem, see the intro- (child item identifier, child node index). Lookup is done duction to the 1st FIMI workshop [3]. with linear scan. Though this is asymptotically not an op- Our implementation is based on a variation of FP-tree, timal structure, the number of elements in a single children a similar data structure than used by FP-growth, but with a map is expected to be very small, linear scan has the least more compact representation that allows faster allocation, overhead compared to ordered arrays with binary search, traversal, and optionally projection. It maintains less ad- or search trees/hash maps. The implementation uses syn- ministrative information (the nodes do not need to store tactics that are equivalent to the Standard Template Library their labels (item identifiers), no header lists and children (STL) interface pair-associative container thus it is easy to mappings are required, only counters and parent pointers), exchange this to the RB-tree based STL map or hash map. and allows more recursive steps to be carried out on the It results in a slight performance decrease due to data struc- same data structure, without the need to rebuild it. There ture overhead. are also drawbacks of never rebuilding the tree: though pro- As a final step in the preprocessing phase this trie is jection is possible to filter conditionally infrequent items, copied into the data structure that the core of the algorithm the order of items cannot be changed to adapt to the con- will use, which we will describe later. The core algorithm consists of a recursion. In each step conditional counter is added to the counter of the par- the input is a condition (an itemset), a trie structure and an ent. This is the default aggregation algorithm and it is array of counters that describe the conditional frequencies very fast due to the memory layout of the data struc- of the trie nodes. In the body we iterate through the remain- ture, described later. ing items, calculate the conditional counters for the input condition extended with that single item, and call the recur- sion with the new counters and with the original or a new, • Single node optimization is used near the last levels projected structure, depending on the projection configura- of recursion, when there is at most one node for each tion and the percentage of empty nodes. The core recursion item left in the tree. (This is a slight generalization of is shown as Algorithm 1. the tree being a single chain.) In this case no aggre- gation and calculation of new counters is needed, so a Algorithm 1 Core algorithm specialized very simple recursive procedure starts that Recursion(condition, nextitem, structure, counters): outputs all subsets of the paths in the tree as a frequent for citem=nextitem-1 downto 0 do itemset. if support of citem < min supp then continue at next citem end if newcounters=aggregate conditional pattern base for The core data structure is a trie. Each node contains condition ∪ citem a counter and a pointer to the parent. As the trie is never if projection is beneficial then searched, only traversed from the bottom to the top, child newstructure=projection of structure to newcoun- maps are not required. The nodes are stored in an array, ters node pointers are indices to this array. Recursion(condition∪citem, citem, newstructure, Nodes that are labelled with the same item occupy a con- newcounters) secutive part of this array, this way we do not need to store else the item identifiers in the nodes. Furthermore, we do not Recursion(condition∪citem, citem, structure, new- need the header lists, as processing all nodes of a specified counters) item requires traversing an interval of this array. This also end if allows faster execution as only contiguous memory reads end for are executed. We only need one memory cell per frequent item to store the starting points of these intervals (the item- The recursion has four different implementations, that starts array). suit differently sized FP trees: The parent pointers (indices) and the counters are stored • Very large FP trees that contain millions of nodes are in separate arrays (parents and counters rsp.) to fit the core treated by simultaneous projection: the tree is tra- algorithm’s flexibility: if projection is not beneficial, then versed once and a projection to each item is calculated the recursion proceeds with the same structural information simultaneously. This phase is applied only at the first (parent pointers) but a new set of counters. level of recursion; very large trees are expected to arise The item intervals of the trie are allocated in the array as- from sparse databases, like real market basket data; cending, in topological order. This way the bottom-up and conditional trees projected to a single item are already top-down traversal of the trie is possible with a descend- small in this case. ing rsp. ascending iteration through the array of the trie, still only using contiguous memory reads and writes. This • Sparse aggregate is an aggregation and projection al- order also allows the truncation of the tree to a particular gorithm that does not traverse those part of the tree that level/item: if the structure is not rebuilt but only a set of will not exist in the next projection. To achieve this, a conditional counters is calculated for an item, then the re- linked list is built dynamically that contains the indices cursion can proceed with a smaller sized newcounters array to non-zero counters. This is similar to the header lists and the original parents and itemstarts array. of FP-trees. This aggregation algorithm is used typ- ically near the top of the recursion, where the tree is The pseudocode for conditional aggregation and projec- large and many zeroes are expected. The exact choice tion is shown as Algorithm 2 and 3. Some details are not is tunable with parameters. shown, for example during the aggregation phase we cal- culate the expected size of the projected structure to allow • Dense aggregate is the default aggregation algorithm. decision about the projection benefits and to allocate arrays Each node of the tree is visited exactly once and its for the projected structure. Algorithm 2 Aggregation on the compact trie data structure Algorithm 3 Projection on the compact trie data structure cpb-aggregate(item, parents, itemstarts, counters, new- project(item, parents, itemstarts, newcounters, condfreqs, counters, condfreqs): newparents, newitemstarts, newnewcounters): Input: item is the identifier of the item to add to the cur- Input: newcounters and condfreqs as computed by the ag- rent condition; parents and itemstarts describe the current gregation algorithm; newparents and newitemstarts will structure of the tree; counters and newcounters hold the cur- hold the projected structure; newnewcounters will hold the rent and new conditional counters of the nodes: counters is values of newcounters reordered accordingly. The array an itemstarts[item+1] sized array, newcounters is an item- newcounters is reused during the algorithm to store the old starts[item] sized array; condfreqs will hold the new condi- position to new position mapping. tional frequencies of the items. This is the default (dense) newcounters[0]=0 /*node 0 is reserved for the root*/ aggregation algorithm. nn=1 /*the next free node index*/ fill newcounters and condfreqs with zeroes for citem=0 to item-1 do for n=itemstarts[item] to itemstarts[item+1]-1 do newitemstarts[citem]=nn newcounters[parents[n]]=counters[n] for n=itemstarts[citem] to itemstarts[citem+1]-1 do end for if condfreqs[citem]