-

Characterizing Transactional Databases for Frequent Itemset Mining

Christian Lezcano

clezcano@cs.upc.edu 0

Marta Ariasy

0 0 Computer Science Department, Universitat Politecnica de Catalunya , Spain 1 We use dataset and database interchangeably

This paper presents a study of the characteristics of transactional databases used in frequent itemset mining. Such characterizations have typically been used to benchmark and understand the data mining algorithms working on these databases. The aim of our study is to give a picture of how diverse and representative these benchmarking databases are, both in general but also in the context of particular empirical studies found in the literature. Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones. Our study shows that our list of metrics is able to capture much of the datasets' inner complexity and thus provides a good basis for the characterization of transactional datasets. Finally, we provide a set of representative datasets based on our characterization that may be used as a benchmark safely.

Since the introduction of Frequent Itemset Mining (FIM ) and its early algorithms, a huge number of algorithms have been proposed [ 1, 14, 18, 22, 23 ] (see Fournier-Viger et al. [ 9 ] for a more detailed review of the latest FIM approaches); in fact, itemset mining and the closely related association rule mining have been arguably the hottest topic within the eld of data mining for years. With the appearance of competing itemset mining algorithms comes the need to understand their strengths and weaknesses. Natural questions arise: what algorithm is the fastest for one particular dataset1? What is the best algorithm? Or more realistically: what algorithms work best for what types of datasets?

In an attempt to answer these questions, authors set up and run empirical studies (what we call algorithm benchmarking here). In data mining algorithm benchmarking, one uses a set of datasets as the basis for comparison (a benchmark ), and applies all competing algorithm candidates to the benchmark in order to establish what algorithm is the fastest, uses less mem

ory or gives more accurate results. In order to make the comparison fair, one should use as many datasets as possible, and these should be as diverse as possible.

In other words, the benchmark should be a representative sample of what is to be expected when applying the algorithms in real-world scenarios.

The question of whether a benchmark (that is, a set of databases used for comparing algorithms) is representative or not is a di cult one to answer. Our proposed approach in this work is to characterize each dataset with the computation of several metrics, each capturing a di erent aspect of the datasets' complexity and structure. Trivial metrics are for example the number of transactions in the database, and more complex ones are the ones establishing, for example, their density [ 12 ]. Once one has a description of each dataset by means of a vector of metrics, one can see how the di erent metrics' values are distributed. If they cover a wide range collectively, then the benchmark is a good candidate for being representative. If, on the other hand, values are all clumped together, then the benchmark is probably not a good one to use.

Another use for the characterization of databases is to establish connections between database characteristics and the performance of particular data mining algorithms, so that one could for example make claims such as \algorithm A works well for databases that have many transactions and are dense, but algorithm B works better if the database is sparse and small."

In fact, there are several works that do precisely this, namely, establishing connections between dataset characteristics and algorithm performance [ 11, 24 ]. We detail these works in the next section.

An important tool that we exploit in this paper is the FIMI public repository, where FIM algorithms and benchmark datasets were made publicly accessible to the FIM community. Since its introduction by Goethals and Zaki [11], this repository serves as a standard set of tools to benchmarking algorithmic approaches.

It has been very successful in the sense that most benchmarking studies use datasets from this repository.

Another public repository of transactional datasets is the SPMF open-source data mining mining library.

This repository is constantly updated and provides a greater number of FIM implementations and datasets [ 8 ]. We believe that is paramount to characterize the was the rst one to note the relevance of data charactransactional databases present in both repositories. teristics over an algorithm's performance. The authors For that reason, we use metrics found in the literature show that algorithms that appear better suited for some and others of our own in order to understand their speci c dataset properties do not respond in the same nature and evaluate their representativeness. way for other classes of datasets. Speci cally, they re

Another area where characterizing databases may port that algorithms measured over synthetic datasets be of use is in synthetic database generation. When behave very di erently when run over real datasets. The attempting to generate synthetic data, one needs some same authors explain that the reason for this outcome control over the generated data. This control may come might be that the algorithms were ne-tuned for the in the shape of characterizing the data generated, for synthetic dataset characteristics used in their experiexample, by means of metrics similar to the ones we ment which caused to perform poorly on the real-life propose here. datasets. Since then there has been awareness of the

Finally, we believe that taking into account the importance of benchmarking methods over databases characterization of databases when making claims about with di erent set of dataset characteristics. Goethals an algorithm's performance over competitors is impor- and Zaki [11] address the problem claimed by Zheng tant in order to give enough support to such claims. et al. [ 24 ] and introduce the FIMI repository with much We recognize the fact that this may be out of the scope success. of some work where the focus is on algorithm develop- On algorithm benchmarking. The work of Zimment. To make the life of such developers easier, we mermann [ 25 ] points out several existing issues in the propose a benchmark that covers the full spectrum of evaluation of frequent itemset mining algorithms. Here, values found in our study as a \minimum representative the author denounces the lack of the necessary diversity benchmark" (MRB), so that authors that use our pro- of characteristics to fully understand the algorithms' posed MRB have some guarantees that the databases strengths and limitations. The author notes that it is used are representative. Naturally, if in the future new not the number of datasets utilized in measuring the metrics emerge, the list may have to be revised, so we quality of an algorithm that matters, but it is evaluatpropose this as an evolving MRB. To summarize, the ing an algorithm with datasets with a variety of characcontributions in this paper are: teristics representing real world scenarios. On the other hand, it is also noted by the aforementioned author that 1. We provide a comprehensive list of metrics used for simply adding more datasets to a repository does not endatabase characterization tail that more di erent characteristics are being added. 2. We evaluate the metrics mentioned in the item Brie y, the problem above motivates the use of a colabove over publicly available and commonly used lection of real datasets or arti cial dataset generators databases used for frequent itemset mining algo- that emulate the whole spectrum of genuine characterrithms istics found in real-life databases. In the same way, it is explained that every dataset utilized in benchmark3. We de ne and study benchmark representativeness ing should be comprised of characteristics which emerge of existing works from a real-life process. This means that merely generating new datasets under randomly selected charac4. We propose a minimum representative benchmark, teristics can not be considered appropriate since these i.e., a benchmark whose representativeness is guar- datasets do not re ect real-life behavior. Our work is anteed according to the metrics included here clearly motivated by the criticisms made in this work.

The organization of this paper is as follows. Section On database characterization. A central prop2 describes the related work which is divided into three erty of transactional databases is their density. The best parts. Section 3 presents the de nition of the metrics intuitive notion of dataset density is given by Gouda and used in the characterization of transactional datasets. Zaki [ 12 ] who explain that having long frequent itemsets Then, in Section 4 the experiments and evaluations at high levels of support is what makes a dataset dense. carried out in this work are presented. Finally, Section This is why frequent mining algorithms typically adapt 5 explains the conclusion and future work. their method tactics according to this feature since it is more di cult to work through a dense dataset than a 2 Related work sparse one.

Among the works proposing new metrics for the evaluation of transactional databases properties, Gouda and Zaki [ 12 ] is the rst one, to the best of our knowl

On the connections between database characteristics and algorithm's performance. To the best of our knowledge, the work by Zheng et al. [ 24 ] edge, to introduce a classi cation metric for dataset den- Zaki [ 12 ] described above. The di erence is that Flouvat sity which is based mainly on the positive border length et al. [ 7 ] classi y dataset density considering negative distribution. Essentially, the authors study the shape of border length distribution along with that of positive the positive border distribution cut o at speci c levels borders. The authors nd that negative borders provide of support and classify datasets into four distinct types. extra information which allows a ner understanding Even though this work sheds some light on the charac- of datasets. Concretely, an algorithm's performance is terization of datasets, it does not give any notion of how a ected by how separated the mean distance between the proposed classi cation behaves over di erent levels both borders is. of support. All of the metrics mentioned in these related works

Other works that use positive borders for the char- are included in our list of proposed metrics, as well acterization of databases are Ramesh et al. [ 19, 20 ]. In- as some others. The next section details the di erent stead of using them to expand our understanding of metrics and related concepts considered in our study. dataset properties, they focus on the problem of synthetic database generation. For this, multiple levels 3 De nition of metrics of positive borders distributions are extracted from an We propose to carry out di erent measurements on the original dataset and transferred to a synthetic one af- FIMI and SPMF benchmark datasets utilizing metrics terwards. found in the literature and new ones of our own.

Palmerini et al. [ 16 ] propose new measures to learn We start by de ning basic properties of a transaca dataset density based on the information theoretic tional database such as dataset size (DS) which is concept of entropy and the average support of frequent simply its number of transactions, average transacitemsets. This work uses entropy to take a glance at the tion size (ATS) as the name suggests is the average database density without going though the inspection of all transaction sizes, and maximum transaction of the entire database which is convenient in helping size (MTS) is the maximum size value of all transacalgorithms to decide the best action to take at runtime. tions.

Flouvat et al. [ 6, 7 ] introduce a dataset classi cation Most of the following metrics are based on FIM . which follows the idea of positive borders of Gouda and Let I be a nite set of di erent elements called items and let jI j represent its size. In other words, I represents the the set of points f(s; jBd (FI (s))j) j 8s 2 Sg. database's alphabet and hereafter we call the alphabet Gaifman graph density (GGD) is another metsize as (AS). Any subset of I is denoted as an itemset ric we use to approximate the dataset density. Here, an X . In particular, an itemset with size k is regarded undirected graph G = (V; E) called Gaifman graph is as a k-itemset. Let D be a transactional database of built from a transactional database D . Its vertices are size jD j in which each transaction is represented by an items (i.e. V = I) and two items share an edge if they itemset. Duplication of transactions may exist; thus, appear in at least a transaction together. Then, we comtransactions are di erentiated via identi ers. pute the density of the Gaifman graph G as the ratio of

The support of an itemset sup(X ) is de ned as the the cardinality of its set of edges jEj to the maximum cardinality of the set of transactions in D in which X is a subset. X is considered frequent if its support is greater than or equal to a minimum support minsup de ned by the user, i.e., sup(X ) minsup. This allows to de ne FI (minsup) or simply FI as the set of all frequent itemsets with support greater or equal to minsup.

Durand and Quafafou [ 5 ] present an intuitive view of frequent itemset borders where the positive border is the set of its maximal frequent itemsets (Equation 3.1) and the negative border is the set of minimal infrequent itemsets (Equation 3.2). number of edges, that is GGD = jI j2(jjIEj j 1) .

Palmerini et al. [ 16 ] propose three metrics. The rst one, fraction of 1s (F1), represents the dataset D as a binary matrix D0 where every row is deemed as a transaction of D and every column as an item X 2 I. An element in D0 is marked by 1 depending on whether item X appears in the transaction, and 0 otherwise. The fraction of 1s is calculated as the ratio of the number of 1s in D0 to its number of elements, i.e., F 1 = cardinality of 1s .

jDj jI j The second one, FI average support, as (3.1) Bd+(F I) = fX 2 F I j 8Y X; Y 62 F Ig its name suggests takes the result of a mining (3.2) Bd (F I) = fX 2 2I n F I j 8Y X; Y 2 F Ig spuropcpeosrst||FaIn(dmimnseuaps)uraelosngthweithdiesvtearnycefreqbueetwnteeitnemtsheet

Maximum singleton support (MSS) [ 20 ] is the average support of all FI itemsets and minsup. We maximum over all singleton supports, i.e., M SS = take this idea, however, we here consider average max8X2I sup(X). This metric provides an upper bound support distance (ASD) as the area under the of the support of any itemset and therefore constitutes curve formed by the points f(s; s) j 8s 2 Sg where an important parameter. This metric also guides us in s = jF I1(s)j P8X2F I(s) sup(X). Hence, it is assumed choosing appropriate thresholds of min support values. that a database is denser the greater its ASD value is. That is, S is the sequence of minsup values given by the user in a itemset mining operation. Namely, S = hs1; s2; s3; : : : ; smi where 0 < sk < sk+1 and sm M SS.

Maximum cardinality di erence (MCD) is de ned to approximate how the number of frequent itemsets is distributed throughout the range of possible supports S de ned above. That is, we de ned MCD as the area under the curve de ned by the set of ordered pairs f(s; jFI (s)j) j 8s 2 Sg. After normalization, a low MCD value indicates that the greater percentage of frequent itemsets concentrates at the lowest levels of support whereas a higher value suggests the number of frequent itemsets does not variate abruptly throughout the di erent support levels S.

Similar to MCD, positive border cardinality Figure 1: Scatter plot of our 21 datasets using metrics (PBC) is the area under the curve given by the set ASD (x axis) and P BC (y axis). The four clusters of ordered pairs f(s; jBd +(FI (s))j) j 8s 2 Sg. Here, found by K-Means are color-marked. we are interested in knowing how the cardinality of the positive border set (Equation 3.1) changes over the We de ne FI average length (FAL) using a prodi erent levels of support. cedure similar to ASD. However, instead of considering

Likewise, negative border cardinality (NBC) is the average support of all F I itemsets, here FAL combased on the same formulation as P BC, but instead it putes the area under the curve using the average length utilizes the negative border set (Equation 3.2) to de ne of all F I itemsets for every minsup de ned by the user. For this, we de ne the set of points as f(s; s) j 8s 2 Sg our needs (ASD; F AL; P BL; P BC; N BC; N BL). where s = jF I1(s)j P8X2F I(s) jXj. Recall the concept of the sequence of minsup S is explained above. 4 Dataset characterization

In addition, positive border average length In this section we present the characterization of the (PBL) and negative border average length (NBL) datasets under study, and introduce our notion of are denoted with the same formulation as the previous minimum representative benchmark. Finally, we include FAL metric. In this case, PBL and NBL utilize the set a description of how the FIM literature has employed Bd+(F I(s)) and Bd (F I(s)) respectively instead of the these datasets for algorithm benchmarking. F I(s). It is worth mentioning that in this work all the We consider two public repositories: FIMI 2, and metrics which base their calculation on itemset lengths SPMF 3; these two repositories have been made availsuch as FAL, PBL, and NBL are not normalized. able to the community for the purpose of benchmark

For their third metric, Palmerini et al. [ 16 ] use the ing new and existing FIM algorithms. Both repositoconcept of entropy over probability distributions of k- ries possess collections of real-life datasets (listed in Taitemsets in order to attempt a density estimation of the ble 1, from row 3 to row 21). The rst two datasets database without needing to work through it entirely. (i.e. row 1 and 2) are also real datasets taken from W. This is done by rst computing H1 which is the entropy Hamalainen of the University of Eastern Finland4. The of the 1-itemset set (i.e. the singleton set), then H2 rst four columns of Table 1 present elemental properconsidering the 2-itemset set, and so forth. ties of any transactional database: the number of trans

In general they de ne Hk = Pi2 k pi log2(pi); actions (DS), alphabet size (AS), average transaction where k is the set of k-itemsets, and pi corresponds to size (ATS), and the maximum transaction size (MTS). the relative support of a given k-itemset. Datasets with low entropies are considered denser.

We want to point out that most metrics have been 23hhttttpp::////fpihmiil.iupap.ea-cf.obuern(iaecrc-esvsiegderS.ecpotmem/sbpemrf/1, 2017()accessed described and used in the past, however, we have de ned September 1, 2017) new ones (M CD; GGD) and adapted existing ones to 4http://www.cs.uef.fi/~whamalai/datasets.html (accessed September 1, 2017)

Metrics F1 and GGD of Table 1 and MSS of 4.1 Minimum representative benchmark. IntuTable 2 are presented as percentage. Metrics H1 and itively, a representative benchmark is a set of datasets H2 of Table 1 have been calculated based on the one can safely do empirical studies on (for example, formulation of entropy H using 1-itemset and 2-itemset checking whether one algorithm is better than another sets, respectively. The ve metrics of Table 2|MCD, in a benchmarking study). As an example, suppose ASD, FAL, PBC, and, PBL|are computed at di erent that we only care about the number of transactions of a levels of support. These levels of support are presented dataset. Then, a benchmark would be representative if in the same table for each dataset and are named by the it were to include datasets with few transactions as well notation S de ned previously in Section 3. as datasets with a high number of them, and perhaps

The formulation of the metrics related to negative datasets with an intermediate number of transactions. borders NBC and NBL have been presented in this In a sense, we are clustering the datasets according to work, however, the experimental results on these met- this particular value, and a representative benchmark rics will be included in a future work. is one that includes at least a representative from each

Two observations are important. Firstly, each of cluster. We follow this idea, but instead of using a single the support levels in S is given in percentage and is characteristic, we use a whole array of them { namely, denoted as the ratio of the number of transactions to all the metrics considered in Section 3. the size of the database. Secondly, that the highest level Therefore, our working assumption is that if the of support used with a particular dataset is bounded by values of the benchmark cover the whole range of the datasets' MSS value. Given this upper bound, we possible values, then it is representative. Moreover, have taken (roughly) equidistant intervals of support. we seek minimum benchmarks in the sense that we

Next, we show in Table 1 and Table 2 the values want to include as few datasets as possible (adding obtained for each elemental metric and also for the more datasets with similar characteristics to the benchmark sophisticated metrics described in Section 3 for our 21 does not enrich the benchmark and slows down the datasets. benchmarking process).

We have clustered the 21 datasets used in this paper into four clusters using K-Means. In order to perform the clustering, we have used the scikit-learn package [ 17 ], more concretely its kmeans++ initialization version with 500 restarts. Metrics have been scaled prior to clustering with RobustScaler from scikit-learn to avoid di erent ranges of values perverting the clustering process. Finally, we have chosen to partition into four clusters because this resulted into the most succint description of the resulting clusters (please see paragraph below on the tree description of the clusters regarding Figure 2).

Figure 1 shows the four clusters found, which are: f4; 8; 10; 11; 14; 21g (cluster 0), f12; 16; 17; 20g (cluster 1), f15g (cluster 2), and

nally f1; 2; 3; 5; 6; 7; 9; 13; 18; 19g (cluster 3). The underlying idea is that datasets within clusters are similar to each other (in terms of their metrics' values), but di erent to others in di erent clusters. Any representative benchmark should therefore include at least one dataset of each of the four clusters found (a representative benchmark constitutes a hitting set for the four clusters).

In order to understand the nature of the four clusters found by K-Means, we have generated a decision tree that is able to classify all datasets into their right cluster using metrics' values as attributes. This tree can be seen in Figure 2. Cluster 0, for example, is formed by datasets having a value of at most 2006 in the ASD metric. Cluster 1 consists of those datasets having large value in ASD and a value for P BC between 1945 and 6416. Cluster 3 consists of datasets having large values in ASD but small values for P BC. Finally, cluster 2 consists of datasets having large values for ASD and P BC.

4.2 Literature review of empirical studies. Ta

ble 3 presents the real-life benchmarking datasets used by several authors to performing comparison studies on FIM algorithms. In here, Goethals and Zaki [11] and Uno et al. [ 22 ] based their benchmarking studies on most of the datasets presented in this work and closely followed by Uno et al. [21] and Burdick et al. [ 2 ]. This information allows us to identify which works need to expand their pool of benchmarking datasets in order to consider their outcome as closer to reality, in addition to giving the FIM community a global vision of the way the set of public benchmarking datasets is distributed among the di erent authors.

In order to nd out which of the sets employed in the empirical studies of Table 3 are representative, we need to establish which of the benchmarks employed constitute hitting sets of the four groups de ned previously in Section 4.1.

The datasets used in Goethals and Zaki [11] and Uno et al. [ 22 ] are f3; 5; 6; 7; 8; 9; 10; 11; 13; 16g. This set intersects with all the clusters with the exception of recordLink dataset (cluster 2) thus they miss the characteristic metrics provided by this cluster. So, both studies would only need to include recordLink to comply with the requirement. The remaining studies mentioned in Table 3 miss datasets from more than one cluster.

Hence, according to the analysis and criteria used in this work, we conclude that there is no study that uses a representative benchmark which is very useful for the various authors to take into account when deciding the set of benchmarking datasets to be used in their experiments. Besides, this work serves as a guide and motivation for new benchmarking datasets to be introduced into the public domain in order to be further characterized and extend the range of characteristics coverage. 5

Conclusions and future work

In this work we provide a study of metrics for the characterization of transactional databases commonly used for the benchmarking of itemset mining algorithms. We argue that these metrics are needed in order to assess the diversity and, therefore, the representativeness of such benchmarks so that conclusions drawn from benchmarking studies are su ciently supported. We study the representativeness of databases used as the basis for existing benchmarking studies found in the literature based on our characteristics, thus assessing which of the studies has a better support for their claims. Finally, we propose a way of obtaining benchmarks with guaranteed diversity that authors of new benchmarking studies can bene t from.

As future lines of research we are always on the lookout for new metrics used for the characterization of databases. As new metrics and databases are incorporated into our lists, we should revise the minimum representative benchmarks accordingly. This could be thought of as an evolving process and as such could (and should) potentially be included in the FIMI or similar repository to be used by the data mining community.

Finally, a line of research that we are pursuing is that of synthetic database generation. The existence of this paper is, in fact, due to the fact that we identi ed the need for characterizing databases so that we could assess in some ways the datasets that we generate. With characterization metrics, we can check the nature of the databases that we generate (generally mimicking some original database that due to privacy and ownership constraints cannot be shown or used). For example, we could check whether synthetic and original are similar enough, if that is what is desired. Or we could study how parameter tuning in the generative algorithms a ects the nature of the generated databases. In any case, being able to characterize databases is a powerful and necessary tool to understand the nature of the generated datasets.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. Both authors have been partially supported by TIN2017-89244-R from MINECO (Spain's Ministerio de Economia, Industria y Competitividad) and the recognition 2017SGR-856 (MACDA) from AGAUR (Generalitat de Catalunya). Christian Lezcano is supported by Paraguay's Foreign Postgraduate Scholarship Programme Don Carlos Antonio Lopez (BECAL). 2003. [3] Zhi-Hong Deng and Sheng-Long Lv. Fast mining frequent itemsets using nodesets. Expert Systems with Applications, 41(10):4505{4512, 2014. ISSN 0957-4174.

[4]

ZhiHong

Deng , ZhongHui Wang, and

JiaJian

Jiang . A new algorithm for fast mining frequent itemsets using N-lists . Science China Information Sciences , 55 ( 9 ): 2008 {2030, Sep 2012 . ISSN 1869- 1919.

[5]

Nicolas

Durand and

Mohamed

Quafafou . Frequent Itemset Border Approximation by Dualization. Transactions on Large-Scale Data- and Knowledge-Centered Systems (TLDKS), 26 : 32 { 60 , March 2016 .

[6]

Frederic

Flouvat ,

F De

March , and Jean-Marc Petit . A thorough experimental study of datasets for frequent itemsets . In Data Mining, Fifth IEEE International Conference on, pages 8{ pp. IEEE , 2005 .

[7]

Frederic

Flouvat , Fabien De Marchi, and

JeanMarc

Petit . A new classi cation of datasets for frequent itemsets . Journal of Intelligent Information Systems , 34 ( 1 ):1{ 19 , Feb 2010 . ISSN 1573-7675.

[8]

Philippe

Fournier-Viger ,

Jerry

Chun-Wei Lin , Antonio Gomariz, Ted Gueniche, Azadeh Soltani, Zhihong Deng, and Hoang Thanh Lam. The SPMF open-source data mining library version 2 . In Bettina Berendt, Bjorn Bringmann, Elisa Fromont, Gemma Garriga, Pauli Miettinen, Nikolaj Tatti, and Volker Tresp, editors, Machine Learning and Knowledge Discovery in Databases , pages 36 { 40 , Cham , 2016 . Springer International Publishing. ISBN 978-3-319-46131-1.

[9]

Philippe

Fournier-Viger ,

Jerry

Chun-Wei Lin , Bay Vo, Tin Chi Truong, Ji Zhang, and Hoai Bac Le. A survey of itemset mining . Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery , 7 ( 4 ), 2017 .

[1]

Rakesh

Agrawal and

Ramakrishnan

Srikant . Fast algorithms for mining association rules in large databases . In Proceedings of the 20th International [10]

Bart

Goethals . Survey on frequent pattern mining . Conference on Very Large Data Bases, VLDB '94 , Technical

report

, University of Helsinki, 2003 . pages 487 { 499 , San Francisco, CA, USA, 1994 . Morgan Kaufmann Publishers Inc. ISBN 1-55860- [11]

Bart

Goethals and Mohammed Javeed Zaki. Ad153-8. vances in frequent itemset mining implementations: Introduction to FIMI03 . In FIMI ' 03 ,

[2]

Douglas

Burdick , Manuel Calimlim, Jason Flan- Frequent Itemset Mining Implementations, Pronick,

Johannes

Gehrke , and

Tomi

Yiu . MAFIA: ceedings of the ICDM 2003 Workshop on FreA performance study of mining maximal frequent quent Itemset Mining Implementations, 19 Decemitemsets . In Proceedings of the IEEE ICDM Work- ber 2003 , Melbourne, Florida, USA, 2003 . URL shop on Frequent Itemset Mining Implementations , http://ceur-ws. org/ Vol- 90 /intro.pdf.

[12]

Karam

Gouda and Mohammed Javeed Zaki . E - pages 307 { 316 , Washington, DC, USA, 2005 . IEEE ciently mining maximal frequent itemsets . In Pro- Computer Society. ISBN 0-7695-2404-4. ceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01 , pages 163 { 170 , [21] Takeaki

Uno

, Tatsuya Asai, Yuzo Uchida, and Washington, DC, USA, 2001 . IEEE Computer So- Hiroki Arimura . LCM: An e cient algorithm for ciety . ISBN 0-7695-1119-8 . enumerating frequent closed item sets . In FIMI , 2003 .

[13]

Grahne and

Zhu . Fast algorithms for frequent itemset mining using FP-trees . IEEE Transactions on Knowledge and Data Engineering , 17 ( 10 ): 1347 { 1362 , Oct 2005 . ISSN 1041-4347.

[14] Jiawei

Han

, Jian Pei, Yiwen Yin, and

Runying

Mao . Mining frequent patterns without candidate generation: A frequent-pattern tree approach . Data Mining and Knowledge Discovery , 8 ( 1 ): 53 { 87 , Jan 2004 . ISSN 1573-756X .

[15] Guimei

Liu

, Hongjun Lu, and Je rey Xu Yu. AFOPT: An e cient implementation of pattern growth approach . In In Proceedings of the ICDM workshop , 2003 .

[16]

Palmerini ,

Orlando , and

Perego . Statistical properties of transactional databases . In Proceedings of the 2004 ACM Symposium on Applied Computing, SAC '04 , pages 515 { 519 , New York, NY, USA, 2004 . ACM. ISBN 1-58113-812-1.

[17]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 , 2011 .

[18] Jian

Pei

, Jiawei Han, Hongjun Lu, Shojiro Nishio,

Shiwei

Tang , and

Dongqing

Yang . H-mine: hyperstructure mining of frequent patterns in large databases . In Proceedings 2001 IEEE International Conference on Data Mining , pages 441 { 448 , 2001 .

[19]

Ganesh

Ramesh , William A. Maniatty , and Mohammed J. Zaki . Feasible itemset distributions in data mining: Theory and application . In Proceedings of the Twenty-second ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, PODS '03 , pages 284 { 295 , New York, NY, USA, 2003 . ACM. ISBN 1-58113-670-6.

[20] Ganesh

Ramesh

Mohammed J.

Zaki , and William

Maniatty . Distribution-based synthetic database generation techniques for itemset mining . In Proceedings of the 9th International Database Engineering & Application Symposium , IDEAS ' 05 ,

[22] Takeaki

Uno

, Masashi Kiyomi, and

Hiroki

Arimura . LCM ver. 2: E cient mining algorithms for frequent/closed/maximal itemsets . In FIMI , 2004 .

[23] Mohammed

Zaki . Scalable algorithms for association mining . IEEE Trans. on Knowl. and Data Eng ., 12 ( 3 ): 372 { 390 , May 2000 . ISSN 1041- 4347 . doi: 10 .1109/69.846291.

[24] Zijian

Zheng

, Ron Kohavi, and

Llew

Mason . Real world performance of association rule algorithms . In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01 , pages 401 { 406 , New York, NY, USA, 2001 . ACM. ISBN 1-58113-391-X.

[25]

Albrecht

Zimmermann . The data problem in data mining . ACM SIGKDD Explorations Newsletter , 16 ( 2 ): 38 { 45 , 2015 .