A New Approach for Approximately Mining Frequent Itemsets Timur Valiullin, Joshua Zhexue Huang, Jianfei Yin, and Dingming Wu Big Data Institute, College of Computer Science and Software Engineering Shenzhen University Shenzhen, China Abstract. Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the memory limit. In this paper, we propose a new approach for approximately mining of frequent itemsets in a transaction database. First, we partition the set of transactions in the database into disjoint subsets and make the distribution of frequent itemsets in each subset similar to that of the entire database. Then, we randomly select a set of subsets and independently mine the frequent itemsets in each of them. After that, each frequent itemset discovered from these subsets is voted and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The recalls and precisions of the frequent itemsets from selected subsets are analyzed against the entire database. The experiment results demonstrate that the use of less than 10 percent of the transaction data in the database can achieve more than 87 percent accuracy. The new approach is very suitable for parallel implementation for large transaction database mining. Keywords: Approximate Frequent Itemsets Mining, Random Sample, Partition 1. Introduction Frequent itemsets mining is the first and most critical stage of finding association rules from a transaction database. Association rule mining is one of the main data mining tasks in many applications, such as basket analysis, product recommendation, cross- selling, inventory control, etc. Huge research efforts are devoted to solving frequent itemsets mining problem. Many of these works had considerable impact and led to a plenty of sophisticated and efficient algorithms for association rules mining, such as Apriori [1, 2], FP-Growth (Frequent Pattern Growth) [3–6], Eclat [7–9] and some others. However, the decade fast development of e-commerce, online and off-line shopping has resulted in fast growth of transaction data, which present a tremendous challenge to these existing algorithms, because these algorithms require a large memory to run efficiently on large transaction databases. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 46 Parallel and distributed association rule mining algorithms were developed to handle large transaction databases. Parallel association rule mining algorithms use in- memory computing to efficiently mine association rules from a large transaction database. However, their scalability is limited by the size of memory of the parallel system. Distributed association rule mining algorithms [10, 11] were developed using MapReduce [12] and run on a Hadoop cluster platform. The algorithms have better scalability, but they are not efficient in mining large transaction datasets because of frequent I/O operations and communication overhead between nodes. In this paper, we propose a new approach for mining frequent itemsets from a big transaction dataset. Similar to the distributed algorithms in MapReduce, we partition the dataset into disjoint subsets of the same size. However, we make the distribution of frequent itemsets in each subset similar to the distribution of frequent itemsets in the entire dataset. Then, we randomly select a set of subsets and run a frequent itemset mining algorithm independently to find the local frequent itemsets from each subset. After all frequent itemsets are discovered from the set of subsets, each frequent itemset is voted by all subsets and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The recalls and precisions of the popular frequent itemsets from the selected subsets are analyzed against the entire database to show how many true frequent itemsets in the entire transaction dataset can be discovered from the selected subsets. We have conducted experiments to evaluate the new approach on two datasets. Empirically we have shown that the proposed method is not only capable of producing highly accurate frequent itemsets but also approximating the global frequency of frequent itemsets with very small error. The remaining of this paper is organized as follows. Related works are discussed in Section 2. Section 3 describes the proposed approach. In Section 4, the details of the algorithm are described. Experiments evaluation is presented in Section 5. Finally, conclusions and future work are drawn in Section 6. 2 Related Work Frequent itemsets mining is a well-studied problem in computer science. However, the enormous data growth made traditional methods inadequate. Therefore, parallel and distributed algorithms came in use. Researchers in [13] introduced the parallel implementation of the FP-growth algorithm on GPU. In [10] and [11], the authors introduced two different approaches for mining frequent itemsets in a large database based on MapReduce. In [10], researchers presented two methods for frequent itemsets mining based on Eclat algorithm. The first one is a distributed version of Eclat that partitions the search space more evenly among different processing units, and the second one is a hybrid approach, where k-length frequent itemsets are mined by an Apriori variant, and then the found frequent itemsets are distributed to the mappers where frequent itemsets are mined 47 using Eclat. Authors of [11] presented a novel zone-wise approach for frequent itemsets mining based on sending computations to a multi-node cluster. All mentioned approaches have obtained a speed increase over the traditional algorithms and allowed to increase the size of the dataset used for mining. However, all introduced approaches require using the entire dataset to get the result. In [14], M. Riondato first introduced PARMA (Parallel Randomized Algorithm for Approximate association rule mining). Algorithm sends random subsets of the database to various machines in the cluster as an input. Then, each machine mines the received subset, and reducers combine the result. Research in [15] is the basis for the current work. Random sample partition (RSP) data model was presented, which showed that the block-level samples from an RSP data model can be efficiently used for data analysis. 3. A New Approach In our approach, we split a big dataset into smaller disjoint subsets such that the distribution of frequent itemsets in each subset is similar to the distribution of frequent itemsets in the entire dataset. Mining smaller subsets allows using traditional frequent itemset mining algorithms without experiencing memory limit problems. By combining the results of random subsets, we are able to produce highly accurate approximate frequent itemsets. 3.1 Definitions A transactional dataset D={t1,t2,...,tn} is represented by a collection of n transactions, where each transaction t is a subset of the set of items I={I1,I2,...,Im}. An itemset A with k distinct items is referred as k-itemset. In this paper, we do not distinguish itemsets with different numbers of unique items. Given an itemset A, define TD(A) as the set of transactions in D which contain A. The number of transactions in TD(A) is defined as the support of A by D and denoted as support(A)=|TD(A)|. The frequency of A, i.e., proportion of transactions containing A in D, is denoted as . Under the above definitions, the task for finding frequent itemsets from D with respect to a minimal frequency threshold θ is defined as follows. Definition 1. Given a minimum frequency threshold θ for 0<θ≤1, the frequent itemsets mining with respect to θ is finding all itemsets {Ai} for 1≤i≤M with freq(Ai)≥θ, where M is the total number of frequent itemsets found in D. Formally, we define the whole set of frequent itemsets in D as FI(D,I,θ)={(Ai,freqD(Ai)): Ai I, freqD(Ai)≥θ}. Definition 2. Let FI(D,I,θ) be the set of frequent itemsets in D with respect to θ and M=|FI(D,I,θ)| the number of frequent itemsets in FI. The accumulative distribution of frequent itemsets in FI is defined as 48 1 𝑃(𝑓) = ∑ 𝐼(𝑓𝑟𝑒𝑞𝐷 (𝐴𝑖 )) ≥ 𝑓 𝑀 ∀𝐴𝑖 ∈𝐹𝐼 where I() is an indicator function and f is a frequency value for θ≤f≤1. The example of P(f) is shown in Fig. 1. Fig. 1. Example of the accumulative frequent itemsets distribution Let D be a big transactional dataset and P={D1,D2,...,Dk} a partition of D, where and Di ⋂ Dj = for i ≠ j. Di for 1≤i≤k is named as a block of transac- tions of dataset D. Definition 3. Let PD(f) be the accumulative distribution of frequent itemsets FI(D,I,θ) and 𝑃𝐷𝑖 (f) the accumulative distribution of frequent itemsets FI(D i,I,θ) for 1≤i≤k. P is a random sample partition of D if 𝑃𝐷𝑖 (f)→PD(f) as |Di |→|D|. (1) Definition 3 is a redefined definition of random sample partition in [15] with respect to frequent itemsets by replacing the condition of E[𝐹̃𝑘 (t)]=F(t) with condition (1), where 𝐹̃𝑘 (t) denotes the sample distribution function of Dk and E[𝐹̃𝑘 (t)] denotes its expectation. Definition 4. FID(D,I,θ) is called the set of global frequent itemsets and FID (Di,I,θ) i the set of local frequent itemsets. Accordingly, PD(f) is the accumulative distribution of global frequent itemsets and 𝑃𝐷𝑖 (f) is the accumulative distribution of the local frequent itemsets in Di. 49 3.2. Approximate Computing When the transactional dataset D is big and cannot be held in memory, we cannot run a frequent itemset mining algorithm on D to find all frequent itemsets FID(D,I,θ). In this situation, we randomly select a set of l transaction blocks {D1,D2,...,Dl} from the partition P and use the set of local frequent itemsets FIDi(Di,I,θ) for 1≤i≤l to estimate the set of global frequent itemsets FID(D,I,θ). This approach is called approximate frequent itemset mining. Definition 5. Let itemset A be a frequent itemset in 𝐹𝐼𝐷𝑖 (Di,I,θ) for 1≤i≤l. A is called a popular frequent itemset if 𝑙 ∑ 𝐼 (𝐴 ∈ 𝐹𝐼𝐷𝑖 (𝐷𝑖 , 𝐼, 𝜃)) > 𝛼 (2) 𝑖=1 where I() is an indicator function and 𝛼 is a given integer greater than l/2. Definition 6. The frequency of a popular frequent itemset A is defined as The set of all popular frequent itemsets PFI from 𝐹𝐼𝐷𝑖 (Di,I,θ) for 1≤i≤l is the estimation of the set of global frequent itemsets FID(D,I,θ). Given PFI and assuming FID(D,I,θ) is known, an itemset A has the following status: - true positive if 𝐴 ∈ 𝑃𝐹𝐼 and 𝐴 ∈ 𝐹𝐼𝐷 (𝐷, 𝐼, 𝜃) - false positive if 𝐴 ∈ 𝑃𝐹𝐼 and 𝐴 ∉ 𝐹𝐼𝐷 (𝐷, 𝐼, 𝜃) - true negative if 𝐴 ∉ 𝑃𝐹𝐼 and 𝐴 ∉ 𝐹𝐼𝐷 (𝐷, 𝐼, 𝜃) - false negative if 𝐴 ∉ 𝑃𝐹𝐼 and 𝐴 ∈ 𝐹𝐼𝐷 (𝐷, 𝐼, 𝜃) 4. An Approximate Frequent Itemsets Finding Algorithm In this section, we propose an approximate algorithm for finding the set of popular frequent itemsets from a set of l transaction blocks {D1,D2,...,Dl} randomly selected from the partition of a big transactional dataset D, and using the popular frequent itemsets to estimate the set of frequent itemsets in D with respect to a given frequency threshold θ. The algorithm consists of three steps: converting the dataset D into a partition of k transaction blocks and randomly selecting l blocks from the partition; finding the local frequent itemsets for each of l selected transaction blocks; finding the popular frequent itemsets from the local frequent itemsets. 50 4.1 Generate Partition of Transaction Blocks Given a transaction dataset D, the first step is to convert it to a partition of transaction blocks. D is preprocessed such that each record represents one purchase transaction and the transactions with one purchased item are removed. The pseudo code for creating the random partition is given in Algorithm 1. Algorithm 1 RSP Blocks generation and selection Input: -D: preprocessed data; -l: number of subsets; -m: subset size; 1: procedure RSPBlocks(D,l,m) 2: 3: for each Di, 1 <= i <= k do 4: randomly assign m transactions from D to the i-th block without replacement 5: end for 6: randomly select l transaction blocks from the set of created k blocks, l <= k 7: Output: set of l transaction blocks of D 8: end procedure 4.2 Finding Local Frequent Itemsets In this step, Apriori algorithm is called to find the local frequent itemsets from each of l transaction blocks {D1,D2,...,Dl} with respect to a given minimum frequency threshold θ. Finally, l sets of local frequent itemsets are obtained. The pseudo code of obtaining local frequent itemsets is presented in Algorithm 2. Algorithm 2 Local frequent itemsets mining Input: -{Dl}: set of l transaction blocks of D; -θ: minimum frequency threshold 1: procedure LocalFIs({Dl},θ) 2: for each Di, 1 <= i <= l do 3: FIi = Apriori(Di,θ) 4: end for 5: Output: {FIl} - set of l sets of the local frequent itemsets 6: end procedure 51 4.3 Finding Popular Local Frequent Itemsets The l sets of local frequent itemsets are united into one set of unique local frequent itemsets. For each frequent itemset in the united set, the number of its appearances in the l sets is checked with Eq. (2). If the condition is satisfied, the frequent itemset is considered as a popular frequent itemset. Otherwise, it is dropped. All local frequent itemsets in the united set are checked and the set of popular frequent itemsets is obtained. These popular frequent itemsets are used as the approximate set of the frequent itemsets in D with respect to the same minimum frequency threshold θ. The pseudo code is given in Algorithm 3. Algorithm 3 Popular Frequent Itemset mining Input: -{FIl}: set of l local frequent itemsets; 1: procedure PopularFIs({FIl}) 2: ) // for all frequent itemsets found, creating pair, where itemset is a key and number of its repeats in all blocks is a value 3: for each frequent itemset FI do 4: if value then 5: include frequent itemset to the set of popular frequent itemsets 6: end if 7: end for 8: Output: set of popular frequent itemsets 9: end procedure 5. Experiments To demonstrate the performance of the approximate frequent itemsets algorithm, we conducted a series of experiments on two datasets. We run our algorithm several times with different numbers of transaction blocks and different block size, and compared the set of popular frequent itemsets with the exact set of frequent itemsets obtained from the entire dataset. 5.1 Datasets We evaluated the proposed approach on 2 datasets downloaded from Kaggle.com and Open-Source Data Mining Library. Properties of datasets used in the experiments are described in Table 1. 52 Table 1. Properties of the datasets used in experiments Kaggle dataset Online Retail dataset Number of transactions 729148 541908 Number of items 791 2603 Average transaction length 8 4 5.2 Experiment Settings In order to test the proposed algorithm, the set of popular frequent itemsets was compared against the set of frequent itemsets in the entire dataset to compute the accuracy, recall, and precision. We conducted 50 experiments for each set of parameters specified in Table 2 and averaged obtained results afterward. For both datasets, we used the same minimum frequency threshold for both local and global frequent itemsets. We chose threshold to be small enough to produce a big collection of the output frequent itemsets and set θ to be 0.005 for all experiments. Testing was started with a comparison of the accumulative distribution of frequent itemsets for local and global frequent itemsets and proceeded with the evaluation of the different metrics of the popular frequent itemsets. Table 2. Parameters used for experiments Number of blocks Block size 50 10000, 5000, 3500, 2000, 1000, 500 30 10000, 5000, 3500, 2000, 1000, 500 15 10000, 5000, 3500, 2000, 1000, 500 10 10000, 5000, 3500, 2000, 1000, 500 5 10000, 5000, 3500, 2000, 1000, 500 5.3 Evaluation Methods For evaluation of the accuracy and sufficiency of obtained approximate frequent itemsets, we used the confusion matrix in our research. Using the confusion matrix allows analyzing the efficiency of the proposed approach more detailed by introducing three measures, namely recall, precision, and accuracy. shows the fraction of the global frequent itemsets that are contained in the popular FIs. shows the fraction of the popular frequent itemsets that are contained in the set of the global frequent itemsets. shows the proportion of accurate results among the total number of cases examined. 53 5.4 Experiment Results We started with a comparison of accumulative distributions of the local and global FIs. Fig. 2(a) clearly shows that the accumulative distribution of global frequent itemsets is similar to the accumulative distributions of local frequent itemsets with a sufficiently large subset size (Fig. 2(a) left and middle graphs). Different colors show the differences of the accumulative distributions between the local frequent itemsets. However, decreasing the size of the subset leads to the growth of the number of local frequent itemsets (Fig. 2(a) right graph) and results in a significant difference between the accumulative distributions of the global and local frequent itemsets. Nevertheless, the accumulative distribution of the popular frequent itemsets shows almost identical accumulative distribution to the global frequent itemsets. The accumulative distributions of the global and popular frequent itemsets are represented in Fig. 2(b). It shows that the number of the popular FIs almost matches to the number of global FIs, and the overall frequency of the popular frequent itemsets is consistent with the frequency of the global frequent itemsets. (a) Accumulative frequent itemsets distributions of the global frequent itemsets (left) and the local frequent itemsets (middle and right) (b) Accumulative frequent itemsets distribution of the global frequent itemsets(left), accumulative frequent itemsets distributions of the popular frequent itemsets Fig. 2. Accumulative frequent itemsets distributions. Number of subsets = 30, subset size (middle) = 10000, subset size (right) = 2000 (Kaggle dataset) The proposed algorithm approximates the exact set of frequent itemsets in the entire dataset. The difference between approximate and global collections is false- 54 positive FIs. The number of false-positive frequent itemsets affects one of the accuracy measures, namely precision. Fig. 3 shows, how the precision value is affected by different experimental parameters. It is observed that using more transaction blocks decreases the number of false-positive itemsets, therefore increasing the precision. From the graph, we can see that the number of falsepositive frequent itemsets decreases as the growth of the block size. Fig. 3. Precision changes with different parameters The overall change of the accuracy defined in terms of a confusion matrix is represented in Fig. 4. From the graphs, we can see that accuracy increase can be obtained by increasing both the number of subsets used and the subset size. Fig. 4. Accuracy changes with different parameters 55 5.5. Result Analysis To evaluate the efficiency of our approach, we conducted 50 independent experiments on both datasets with specified parameters in Table 2 to estimate the performance for each set of parameters. For each test run, the set of approximate frequent itemsets was compared to the exact set of the frequent itemsets obtained by mining the entire dataset. As a result, we received 50 different observations of elapsed time, recall, precision and accuracy for each set of parameters, and then averaged all values. Fig. 5 illustrates how the average accuracy changes with the change in the amount of data being mined. The graphs show that the proposed algorithm is capable of producing approximate frequent itemsets with above 87% of the accuracy, using only a little less than 10% of data. Fig. 5. Accuracy increases with the rise of data used to mine We also conducted an evaluation of the estimated frequency error in the approximate frequent itemsets for all experimental parameters. In Fig. 6, we depict the distribution of the average absolute error in the frequency estimation, defined as: for all itemsets A that are contained in both the approximate and the global frequent itemsets. We can see that the flustration of the error decreases with the increase of the subset size. The error is reduced as the increase of the number of transaction blocks and the block size. 56 Fig. 6. Error in frequency estimations for different parameters (Kaggle dataset) 6. Conclusions and Future Work In this paper, we have presented a new approach for mining approximate collections of frequent itemsets based on a random sample partition of the data. We have shown that using the RSP data model in big data can be very beneficial, especially in the frequent itemset mining task, since the size of transaction database grows much faster than the contained patterns change. For the further work, we are going to implement the parallel version of the algorithm on a cluster and to conduct experiments on big datasets in terabyte scale. We will also conduct a theoretical analysis of the approach. References 1. Agrawal, R., Imielinski, T., and Swami, A.: Mining association rules between sets of items in large databases. In Proceedings of SIGMOD, 1993. 2. Agrawal, R. and Srikant, R.: Fast algorithms for mining association rules in large data bases. In Proceedings of VLDB, 1994. 3. Han, J., Pei, J., and Yin, Y.: Mining frequent patterns without candidate generation. In Proceedings of the 19th ACM International Conference on Management of Data (SIGMOD), 2000. 4. Grahne, G. and Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In Proceedings of the CEUR Workshop Proceedings, 2003. 5. Racz, B. An fp-growth variation without rebuilding the fp-tree. In Proceedings of the CEUR Workshop Proceedings, 2003. 6. Grahne, G. and Zhu, J.: Reducing the main memory consumptions of fpmax* and fpclose. In Proceedings of the CEUR Workshop Proceedings, 2004. 7. Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W.: New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997. 8. Zaki, M.J. and Gouda, K.: Fast vertical mining using diffsets. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining, 326–335, Washington, DC, USA, 2003. 9. Schmidt-Thieme, L.: Algorithmic features of eclat. In Proceedings of the Workshop Frequent Item Set Mining Implementations, 2004. 57 10. Moens, S., Aksehirli, E., and Goethals, B.: Frequent itemset mining for big data. In 2013 IEEE International Conference on Big Data, 2013. 11. Prajapati, D., Garg, S., and Chauhan, N.C.: Interesting association rule mining with consistent and inconsistent rule detection from big sales data in distributed environment. Future Computing and Informatics Journal 2 (1), 19–30 (2017). 12. Dean, J. and Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In Proceedings of the CACM, 107–113 (2004). 13. Jiang, H. and Meng, H.: A parallel fp-growth algorithm based on gpu. In 2017 IEEE 14th Int. Conf. E-bus. Eng, 97–102 (2017). 14. Riondato, M., DeBrabant, J.A., Fonseca, R., and Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In Proceedings of the ACM International Conference on Information and Knowledge Management, 2012. 15. Salloum, S., He, Y., Huang, J.Z., Zhang, X., and Emara, T.Z.: A random sample partition data model for big data. In [Online]. Available: https://arxiv.org/abs/1712.04146, 2017. 58