<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing Transactional Databases for Frequent Itemset Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Lezcano</string-name>
          <email>clezcano@cs.upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Ariasy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Universitat Politecnica de Catalunya</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>We use dataset and database interchangeably</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a study of the characteristics of transactional databases used in frequent itemset mining. Such characterizations have typically been used to benchmark and understand the data mining algorithms working on these databases. The aim of our study is to give a picture of how diverse and representative these benchmarking databases are, both in general but also in the context of particular empirical studies found in the literature. Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones. Our study shows that our list of metrics is able to capture much of the datasets' inner complexity and thus provides a good basis for the characterization of transactional datasets. Finally, we provide a set of representative datasets based on our characterization that may be used as a benchmark safely.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Since the introduction of Frequent Itemset Mining
(FIM ) and its early algorithms, a huge number of
algorithms have been proposed [
        <xref ref-type="bibr" rid="ref11 ref15 ref18 ref19 ref7">1, 14, 18, 22, 23</xref>
        ] (see
Fournier-Viger et al. [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ] for a more detailed review of
the latest FIM approaches); in fact, itemset mining
and the closely related association rule mining have
been arguably the hottest topic within the eld of data
mining for years. With the appearance of competing
itemset mining algorithms comes the need to
understand their strengths and weaknesses. Natural questions
arise: what algorithm is the fastest for one particular
dataset1? What is the best algorithm? Or more
realistically: what algorithms work best for what types of
datasets?
      </p>
      <p>In an attempt to answer these questions, authors
set up and run empirical studies (what we call
algorithm benchmarking here). In data mining algorithm
benchmarking, one uses a set of datasets as the basis
for comparison (a benchmark ), and applies all
competing algorithm candidates to the benchmark in order to
establish what algorithm is the fastest, uses less
mem</p>
      <p>ory or gives more accurate results. In order to make
the comparison fair, one should use as many datasets
as possible, and these should be as diverse as possible.</p>
      <p>In other words, the benchmark should be a
representative sample of what is to be expected when applying the
algorithms in real-world scenarios.</p>
      <p>
        The question of whether a benchmark (that is,
a set of databases used for comparing algorithms) is
representative or not is a di cult one to answer. Our
proposed approach in this work is to characterize each
dataset with the computation of several metrics, each
capturing a di erent aspect of the datasets' complexity
and structure. Trivial metrics are for example the
number of transactions in the database, and more
complex ones are the ones establishing, for example,
their density [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ]. Once one has a description of each
dataset by means of a vector of metrics, one can see
how the di erent metrics' values are distributed. If they
cover a wide range collectively, then the benchmark is
a good candidate for being representative. If, on the
other hand, values are all clumped together, then the
benchmark is probably not a good one to use.
      </p>
      <p>Another use for the characterization of databases
is to establish connections between database
characteristics and the performance of particular data mining
algorithms, so that one could for example make claims
such as \algorithm A works well for databases that have
many transactions and are dense, but algorithm B works
better if the database is sparse and small."</p>
      <p>
        In fact, there are several works that do precisely
this, namely, establishing connections between dataset
characteristics and algorithm performance [
        <xref ref-type="bibr" rid="ref20">11, 24</xref>
        ]. We
detail these works in the next section.
      </p>
      <p>An important tool that we exploit in this paper is
the FIMI public repository, where FIM algorithms and
benchmark datasets were made publicly accessible to
the FIM community. Since its introduction by Goethals
and Zaki [11], this repository serves as a standard
set of tools to benchmarking algorithmic approaches.</p>
      <p>It has been very successful in the sense that most
benchmarking studies use datasets from this repository.</p>
      <p>Another public repository of transactional datasets is
the SPMF open-source data mining mining library.</p>
      <p>
        This repository is constantly updated and provides a
greater number of FIM implementations and datasets
[
        <xref ref-type="bibr" rid="ref5">8</xref>
        ]. We believe that is paramount to characterize the was the rst one to note the relevance of data
charactransactional databases present in both repositories. teristics over an algorithm's performance. The authors
For that reason, we use metrics found in the literature show that algorithms that appear better suited for some
and others of our own in order to understand their speci c dataset properties do not respond in the same
nature and evaluate their representativeness. way for other classes of datasets. Speci cally, they
re
      </p>
      <p>Another area where characterizing databases may port that algorithms measured over synthetic datasets
be of use is in synthetic database generation. When behave very di erently when run over real datasets. The
attempting to generate synthetic data, one needs some same authors explain that the reason for this outcome
control over the generated data. This control may come might be that the algorithms were ne-tuned for the
in the shape of characterizing the data generated, for synthetic dataset characteristics used in their
experiexample, by means of metrics similar to the ones we ment which caused to perform poorly on the real-life
propose here. datasets. Since then there has been awareness of the</p>
      <p>
        Finally, we believe that taking into account the importance of benchmarking methods over databases
characterization of databases when making claims about with di erent set of dataset characteristics. Goethals
an algorithm's performance over competitors is impor- and Zaki [11] address the problem claimed by Zheng
tant in order to give enough support to such claims. et al. [
        <xref ref-type="bibr" rid="ref20">24</xref>
        ] and introduce the FIMI repository with much
We recognize the fact that this may be out of the scope success.
of some work where the focus is on algorithm develop- On algorithm benchmarking. The work of
Zimment. To make the life of such developers easier, we mermann [
        <xref ref-type="bibr" rid="ref21">25</xref>
        ] points out several existing issues in the
propose a benchmark that covers the full spectrum of evaluation of frequent itemset mining algorithms. Here,
values found in our study as a \minimum representative the author denounces the lack of the necessary diversity
benchmark" (MRB), so that authors that use our pro- of characteristics to fully understand the algorithms'
posed MRB have some guarantees that the databases strengths and limitations. The author notes that it is
used are representative. Naturally, if in the future new not the number of datasets utilized in measuring the
metrics emerge, the list may have to be revised, so we quality of an algorithm that matters, but it is
evaluatpropose this as an evolving MRB. To summarize, the ing an algorithm with datasets with a variety of
characcontributions in this paper are: teristics representing real world scenarios. On the other
hand, it is also noted by the aforementioned author that
1. We provide a comprehensive list of metrics used for simply adding more datasets to a repository does not
endatabase characterization tail that more di erent characteristics are being added.
2. We evaluate the metrics mentioned in the item Brie y, the problem above motivates the use of a
colabove over publicly available and commonly used lection of real datasets or arti cial dataset generators
databases used for frequent itemset mining algo- that emulate the whole spectrum of genuine
characterrithms istics found in real-life databases. In the same way, it
is explained that every dataset utilized in
benchmark3. We de ne and study benchmark representativeness ing should be comprised of characteristics which emerge
of existing works from a real-life process. This means that merely
generating new datasets under randomly selected
charac4. We propose a minimum representative benchmark, teristics can not be considered appropriate since these
i.e., a benchmark whose representativeness is guar- datasets do not re ect real-life behavior. Our work is
anteed according to the metrics included here clearly motivated by the criticisms made in this work.
      </p>
      <p>
        The organization of this paper is as follows. Section On database characterization. A central
prop2 describes the related work which is divided into three erty of transactional databases is their density. The best
parts. Section 3 presents the de nition of the metrics intuitive notion of dataset density is given by Gouda and
used in the characterization of transactional datasets. Zaki [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ] who explain that having long frequent itemsets
Then, in Section 4 the experiments and evaluations at high levels of support is what makes a dataset dense.
carried out in this work are presented. Finally, Section This is why frequent mining algorithms typically adapt
5 explains the conclusion and future work. their method tactics according to this feature since it is
more di cult to work through a dense dataset than a
2 Related work sparse one.
      </p>
      <p>
        Among the works proposing new metrics for the
evaluation of transactional databases properties, Gouda
and Zaki [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ] is the rst one, to the best of our
knowl
      </p>
      <p>
        On the connections between database
characteristics and algorithm's performance. To the
best of our knowledge, the work by Zheng et al. [
        <xref ref-type="bibr" rid="ref20">24</xref>
        ]
edge, to introduce a classi cation metric for dataset den- Zaki [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ] described above. The di erence is that Flouvat
sity which is based mainly on the positive border length et al. [
        <xref ref-type="bibr" rid="ref4">7</xref>
        ] classi y dataset density considering negative
distribution. Essentially, the authors study the shape of border length distribution along with that of positive
the positive border distribution cut o at speci c levels borders. The authors nd that negative borders provide
of support and classify datasets into four distinct types. extra information which allows a ner understanding
Even though this work sheds some light on the charac- of datasets. Concretely, an algorithm's performance is
terization of datasets, it does not give any notion of how a ected by how separated the mean distance between
the proposed classi cation behaves over di erent levels both borders is.
of support. All of the metrics mentioned in these related works
      </p>
      <p>
        Other works that use positive borders for the char- are included in our list of proposed metrics, as well
acterization of databases are Ramesh et al. [
        <xref ref-type="bibr" rid="ref16 ref17">19, 20</xref>
        ]. In- as some others. The next section details the di erent
stead of using them to expand our understanding of metrics and related concepts considered in our study.
dataset properties, they focus on the problem of
synthetic database generation. For this, multiple levels 3 De nition of metrics
of positive borders distributions are extracted from an We propose to carry out di erent measurements on the
original dataset and transferred to a synthetic one af- FIMI and SPMF benchmark datasets utilizing metrics
terwards. found in the literature and new ones of our own.
      </p>
      <p>
        Palmerini et al. [
        <xref ref-type="bibr" rid="ref13">16</xref>
        ] propose new measures to learn We start by de ning basic properties of a
transaca dataset density based on the information theoretic tional database such as dataset size (DS) which is
concept of entropy and the average support of frequent simply its number of transactions, average
transacitemsets. This work uses entropy to take a glance at the tion size (ATS) as the name suggests is the average
database density without going though the inspection of all transaction sizes, and maximum transaction
of the entire database which is convenient in helping size (MTS) is the maximum size value of all
transacalgorithms to decide the best action to take at runtime. tions.
      </p>
      <p>
        Flouvat et al. [
        <xref ref-type="bibr" rid="ref3 ref4">6, 7</xref>
        ] introduce a dataset classi cation Most of the following metrics are based on FIM .
which follows the idea of positive borders of Gouda and Let I be a nite set of di erent elements called items and
let jI j represent its size. In other words, I represents the the set of points f(s; jBd (FI (s))j) j 8s 2 Sg.
database's alphabet and hereafter we call the alphabet Gaifman graph density (GGD) is another
metsize as (AS). Any subset of I is denoted as an itemset ric we use to approximate the dataset density. Here, an
X . In particular, an itemset with size k is regarded undirected graph G = (V; E) called Gaifman graph is
as a k-itemset. Let D be a transactional database of built from a transactional database D . Its vertices are
size jD j in which each transaction is represented by an items (i.e. V = I) and two items share an edge if they
itemset. Duplication of transactions may exist; thus, appear in at least a transaction together. Then, we
comtransactions are di erentiated via identi ers. pute the density of the Gaifman graph G as the ratio of
      </p>
      <p>The support of an itemset sup(X ) is de ned as the the cardinality of its set of edges jEj to the maximum
cardinality of the set of transactions in D in which X is a
subset. X is considered frequent if its support is greater
than or equal to a minimum support minsup de ned
by the user, i.e., sup(X ) minsup. This allows to
de ne FI (minsup) or simply FI as the set of all frequent
itemsets with support greater or equal to minsup.</p>
      <p>
        Durand and Quafafou [
        <xref ref-type="bibr" rid="ref2">5</xref>
        ] present an intuitive view
of frequent itemset borders where the positive border is
the set of its maximal frequent itemsets (Equation 3.1)
and the negative border is the set of minimal infrequent
itemsets (Equation 3.2).
number of edges, that is GGD = jI j2(jjIEj j 1) .
      </p>
      <p>
        Palmerini et al. [
        <xref ref-type="bibr" rid="ref13">16</xref>
        ] propose three metrics. The
rst one, fraction of 1s (F1), represents the dataset
D as a binary matrix D0 where every row is deemed
as a transaction of D and every column as an item
X 2 I. An element in D0 is marked by 1 depending
on whether item X appears in the transaction, and 0
otherwise. The fraction of 1s is calculated as the ratio
of the number of 1s in D0 to its number of elements, i.e.,
F 1 = cardinality of 1s .
      </p>
      <p>jDj jI j
The second one, FI average support, as
(3.1) Bd+(F I) = fX 2 F I j 8Y X; Y 62 F Ig its name suggests takes the result of a mining
(3.2) Bd (F I) = fX 2 2I n F I j 8Y X; Y 2 F Ig spuropcpeosrst||FaIn(dmimnseuaps)uraelosngthweithdiesvtearnycefreqbueetwnteeitnemtsheet</p>
      <p>
        Maximum singleton support (MSS) [
        <xref ref-type="bibr" rid="ref17">20</xref>
        ] is the average support of all FI itemsets and minsup. We
maximum over all singleton supports, i.e., M SS = take this idea, however, we here consider average
max8X2I sup(X). This metric provides an upper bound support distance (ASD) as the area under the
of the support of any itemset and therefore constitutes curve formed by the points f(s; s) j 8s 2 Sg where
an important parameter. This metric also guides us in s = jF I1(s)j P8X2F I(s) sup(X). Hence, it is assumed
choosing appropriate thresholds of min support values. that a database is denser the greater its ASD value is.
That is, S is the sequence of minsup values given
by the user in a itemset mining operation. Namely,
S = hs1; s2; s3; : : : ; smi where 0 &lt; sk &lt; sk+1 and
sm M SS.
      </p>
      <p>Maximum cardinality di erence (MCD) is
de ned to approximate how the number of frequent
itemsets is distributed throughout the range of possible
supports S de ned above. That is, we de ned MCD as
the area under the curve de ned by the set of ordered
pairs f(s; jFI (s)j) j 8s 2 Sg. After normalization, a
low MCD value indicates that the greater percentage
of frequent itemsets concentrates at the lowest levels of
support whereas a higher value suggests the number of
frequent itemsets does not variate abruptly throughout
the di erent support levels S.</p>
      <p>Similar to MCD, positive border cardinality Figure 1: Scatter plot of our 21 datasets using metrics
(PBC) is the area under the curve given by the set ASD (x axis) and P BC (y axis). The four clusters
of ordered pairs f(s; jBd +(FI (s))j) j 8s 2 Sg. Here, found by K-Means are color-marked.
we are interested in knowing how the cardinality of
the positive border set (Equation 3.1) changes over the We de ne FI average length (FAL) using a
prodi erent levels of support. cedure similar to ASD. However, instead of considering</p>
      <p>Likewise, negative border cardinality (NBC) is the average support of all F I itemsets, here FAL
combased on the same formulation as P BC, but instead it putes the area under the curve using the average length
utilizes the negative border set (Equation 3.2) to de ne of all F I itemsets for every minsup de ned by the user.
For this, we de ne the set of points as f(s; s) j 8s 2 Sg our needs (ASD; F AL; P BL; P BC; N BC; N BL).
where s = jF I1(s)j P8X2F I(s) jXj. Recall the concept
of the sequence of minsup S is explained above. 4 Dataset characterization</p>
      <p>In addition, positive border average length In this section we present the characterization of the
(PBL) and negative border average length (NBL) datasets under study, and introduce our notion of
are denoted with the same formulation as the previous minimum representative benchmark. Finally, we include
FAL metric. In this case, PBL and NBL utilize the set a description of how the FIM literature has employed
Bd+(F I(s)) and Bd (F I(s)) respectively instead of the these datasets for algorithm benchmarking.
F I(s). It is worth mentioning that in this work all the We consider two public repositories: FIMI 2, and
metrics which base their calculation on itemset lengths SPMF 3; these two repositories have been made
availsuch as FAL, PBL, and NBL are not normalized. able to the community for the purpose of
benchmark</p>
      <p>
        For their third metric, Palmerini et al. [
        <xref ref-type="bibr" rid="ref13">16</xref>
        ] use the ing new and existing FIM algorithms. Both
repositoconcept of entropy over probability distributions of k- ries possess collections of real-life datasets (listed in
Taitemsets in order to attempt a density estimation of the ble 1, from row 3 to row 21). The rst two datasets
database without needing to work through it entirely. (i.e. row 1 and 2) are also real datasets taken from W.
This is done by rst computing H1 which is the entropy Hamalainen of the University of Eastern Finland4. The
of the 1-itemset set (i.e. the singleton set), then H2 rst four columns of Table 1 present elemental
properconsidering the 2-itemset set, and so forth. ties of any transactional database: the number of
trans
      </p>
      <p>In general they de ne Hk = Pi2 k pi log2(pi); actions (DS), alphabet size (AS), average transaction
where k is the set of k-itemsets, and pi corresponds to size (ATS), and the maximum transaction size (MTS).
the relative support of a given k-itemset. Datasets with
low entropies are considered denser.</p>
      <p>We want to point out that most metrics have been 23hhttttpp::////fpihmiil.iupap.ea-cf.obuern(iaecrc-esvsiegderS.ecpotmem/sbpemrf/1, 2017()accessed
described and used in the past, however, we have de ned September 1, 2017)
new ones (M CD; GGD) and adapted existing ones to 4http://www.cs.uef.fi/~whamalai/datasets.html (accessed
September 1, 2017)</p>
      <p>Metrics F1 and GGD of Table 1 and MSS of 4.1 Minimum representative benchmark.
IntuTable 2 are presented as percentage. Metrics H1 and itively, a representative benchmark is a set of datasets
H2 of Table 1 have been calculated based on the one can safely do empirical studies on (for example,
formulation of entropy H using 1-itemset and 2-itemset checking whether one algorithm is better than another
sets, respectively. The ve metrics of Table 2|MCD, in a benchmarking study). As an example, suppose
ASD, FAL, PBC, and, PBL|are computed at di erent that we only care about the number of transactions of a
levels of support. These levels of support are presented dataset. Then, a benchmark would be representative if
in the same table for each dataset and are named by the it were to include datasets with few transactions as well
notation S de ned previously in Section 3. as datasets with a high number of them, and perhaps</p>
      <p>The formulation of the metrics related to negative datasets with an intermediate number of transactions.
borders NBC and NBL have been presented in this In a sense, we are clustering the datasets according to
work, however, the experimental results on these met- this particular value, and a representative benchmark
rics will be included in a future work. is one that includes at least a representative from each</p>
      <p>Two observations are important. Firstly, each of cluster. We follow this idea, but instead of using a single
the support levels in S is given in percentage and is characteristic, we use a whole array of them { namely,
denoted as the ratio of the number of transactions to all the metrics considered in Section 3.
the size of the database. Secondly, that the highest level Therefore, our working assumption is that if the
of support used with a particular dataset is bounded by values of the benchmark cover the whole range of
the datasets' MSS value. Given this upper bound, we possible values, then it is representative. Moreover,
have taken (roughly) equidistant intervals of support. we seek minimum benchmarks in the sense that we</p>
      <p>Next, we show in Table 1 and Table 2 the values want to include as few datasets as possible (adding
obtained for each elemental metric and also for the more datasets with similar characteristics to the benchmark
sophisticated metrics described in Section 3 for our 21 does not enrich the benchmark and slows down the
datasets. benchmarking process).</p>
      <p>
        We have clustered the 21 datasets used in this paper
into four clusters using K-Means. In order to perform
the clustering, we have used the scikit-learn package
[
        <xref ref-type="bibr" rid="ref14">17</xref>
        ], more concretely its kmeans++ initialization version
with 500 restarts. Metrics have been scaled prior to
clustering with RobustScaler from scikit-learn to
avoid di erent ranges of values perverting the clustering
process. Finally, we have chosen to partition into four
clusters because this resulted into the most succint
description of the resulting clusters (please see paragraph
below on the tree description of the clusters regarding
Figure 2).
      </p>
      <p>Figure 1 shows the four clusters found,
which are: f4; 8; 10; 11; 14; 21g (cluster 0),
f12; 16; 17; 20g (cluster 1), f15g (cluster 2), and</p>
      <p>nally f1; 2; 3; 5; 6; 7; 9; 13; 18; 19g (cluster 3). The
underlying idea is that datasets within clusters are
similar to each other (in terms of their metrics' values),
but di erent to others in di erent clusters. Any
representative benchmark should therefore include at
least one dataset of each of the four clusters found (a
representative benchmark constitutes a hitting set for
the four clusters).</p>
      <p>In order to understand the nature of the four
clusters found by K-Means, we have generated a decision
tree that is able to classify all datasets into their right
cluster using metrics' values as attributes. This tree can
be seen in Figure 2. Cluster 0, for example, is formed
by datasets having a value of at most 2006 in the ASD
metric. Cluster 1 consists of those datasets having large
value in ASD and a value for P BC between 1945 and
6416. Cluster 3 consists of datasets having large values
in ASD but small values for P BC. Finally, cluster 2
consists of datasets having large values for ASD and
P BC.</p>
    </sec>
    <sec id="sec-2">
      <title>4.2 Literature review of empirical studies. Ta</title>
      <p>
        ble 3 presents the real-life benchmarking datasets used
by several authors to performing comparison studies on
FIM algorithms. In here, Goethals and Zaki [11] and
Uno et al. [
        <xref ref-type="bibr" rid="ref18">22</xref>
        ] based their benchmarking studies on most
of the datasets presented in this work and closely
followed by Uno et al. [21] and Burdick et al. [
        <xref ref-type="bibr" rid="ref8">2</xref>
        ]. This
information allows us to identify which works need to
expand their pool of benchmarking datasets in order to
consider their outcome as closer to reality, in addition
to giving the FIM community a global vision of the way
the set of public benchmarking datasets is distributed
among the di erent authors.
      </p>
      <p>In order to nd out which of the sets employed in the
empirical studies of Table 3 are representative, we need
to establish which of the benchmarks employed
constitute hitting sets of the four groups de ned previously in
Section 4.1.</p>
      <p>
        The datasets used in Goethals and Zaki [11] and
Uno et al. [
        <xref ref-type="bibr" rid="ref18">22</xref>
        ] are f3; 5; 6; 7; 8; 9; 10; 11; 13; 16g. This
set intersects with all the clusters with the exception
of recordLink dataset (cluster 2) thus they miss the
characteristic metrics provided by this cluster. So,
both studies would only need to include recordLink
to comply with the requirement. The remaining studies
mentioned in Table 3 miss datasets from more than one
cluster.
      </p>
      <p>Hence, according to the analysis and criteria used
in this work, we conclude that there is no study that
uses a representative benchmark which is very useful for
the various authors to take into account when deciding
the set of benchmarking datasets to be used in their
experiments. Besides, this work serves as a guide
and motivation for new benchmarking datasets to be
introduced into the public domain in order to be further
characterized and extend the range of characteristics
coverage.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and future work</title>
      <p>In this work we provide a study of metrics for the
characterization of transactional databases commonly used
for the benchmarking of itemset mining algorithms. We
argue that these metrics are needed in order to assess
the diversity and, therefore, the representativeness of
such benchmarks so that conclusions drawn from
benchmarking studies are su ciently supported. We study
the representativeness of databases used as the basis
for existing benchmarking studies found in the
literature based on our characteristics, thus assessing which
of the studies has a better support for their claims.
Finally, we propose a way of obtaining benchmarks with
guaranteed diversity that authors of new benchmarking
studies can bene t from.</p>
      <p>As future lines of research we are always on the
lookout for new metrics used for the characterization
of databases. As new metrics and databases are
incorporated into our lists, we should revise the minimum
representative benchmarks accordingly. This could be
thought of as an evolving process and as such could (and
should) potentially be included in the FIMI or similar
repository to be used by the data mining community.</p>
      <p>Finally, a line of research that we are pursuing is
that of synthetic database generation. The existence of
this paper is, in fact, due to the fact that we identi ed
the need for characterizing databases so that we could
assess in some ways the datasets that we generate. With
characterization metrics, we can check the nature of the
databases that we generate (generally mimicking some
original database that due to privacy and ownership
constraints cannot be shown or used). For example, we
could check whether synthetic and original are similar
enough, if that is what is desired. Or we could study
how parameter tuning in the generative algorithms
a ects the nature of the generated databases. In any
case, being able to characterize databases is a powerful
and necessary tool to understand the nature of the
generated datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>We thank the anonymous reviewers for their helpful
comments. Both authors have been partially supported
by TIN2017-89244-R from MINECO (Spain's
Ministerio de Economia, Industria y Competitividad) and
the recognition 2017SGR-856 (MACDA) from AGAUR
(Generalitat de Catalunya). Christian Lezcano is
supported by Paraguay's Foreign Postgraduate Scholarship
Programme Don Carlos Antonio Lopez (BECAL).
2003.
[3] Zhi-Hong Deng and Sheng-Long Lv. Fast mining
frequent itemsets using nodesets. Expert Systems
with Applications, 41(10):4505{4512, 2014. ISSN
0957-4174.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>ZhiHong</given-names>
            <surname>Deng</surname>
          </string-name>
          , ZhongHui Wang, and
          <string-name>
            <given-names>JiaJian</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <article-title>A new algorithm for fast mining frequent itemsets using N-lists</article-title>
          .
          <source>Science China Information Sciences</source>
          ,
          <volume>55</volume>
          (
          <issue>9</issue>
          ):
          <year>2008</year>
          {2030,
          <article-title>Sep 2012</article-title>
          .
          <article-title>ISSN 1869- 1919.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Durand</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Quafafou</surname>
          </string-name>
          .
          <article-title>Frequent Itemset Border Approximation by Dualization. Transactions on Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered Systems</surname>
          </string-name>
          (TLDKS),
          <volume>26</volume>
          :
          <fpage>32</fpage>
          {
          <fpage>60</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Flouvat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F De</given-names>
            <surname>March</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jean-Marc Petit</surname>
          </string-name>
          .
          <article-title>A thorough experimental study of datasets for frequent itemsets</article-title>
          .
          <source>In Data Mining, Fifth IEEE International Conference on, pages 8{</source>
          pp.
          <source>IEEE</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Flouvat</surname>
          </string-name>
          , Fabien De Marchi, and
          <string-name>
            <given-names>JeanMarc</given-names>
            <surname>Petit</surname>
          </string-name>
          .
          <article-title>A new classi cation of datasets for frequent itemsets</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>19</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          <year>2010</year>
          . ISSN 1573-7675.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Fournier-Viger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jerry</given-names>
            <surname>Chun-Wei</surname>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , Antonio Gomariz, Ted Gueniche, Azadeh Soltani, Zhihong Deng, and
          <article-title>Hoang Thanh Lam. The SPMF open-source data mining library version 2</article-title>
          . In Bettina Berendt, Bjorn Bringmann, Elisa Fromont, Gemma Garriga, Pauli Miettinen, Nikolaj Tatti, and Volker Tresp, editors,
          <source>Machine Learning and Knowledge Discovery in Databases</source>
          , pages
          <volume>36</volume>
          {
          <fpage>40</fpage>
          ,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          ,
          <year>2016</year>
          . Springer International Publishing.
          <source>ISBN 978-3-319-46131-1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Fournier-Viger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jerry</given-names>
            <surname>Chun-Wei</surname>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , Bay Vo, Tin Chi Truong, Ji Zhang, and Hoai Bac Le.
          <article-title>A survey of itemset mining</article-title>
          .
          <source>Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery</source>
          ,
          <volume>7</volume>
          (
          <issue>4</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ramakrishnan</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules in large databases</article-title>
          .
          <source>In Proceedings of the 20th International</source>
          [10]
          <string-name>
            <given-names>Bart</given-names>
            <surname>Goethals</surname>
          </string-name>
          .
          <article-title>Survey on frequent pattern mining</article-title>
          .
          <source>Conference on Very Large Data Bases, VLDB '94</source>
          ,
          <string-name>
            <surname>Technical</surname>
            <given-names>report</given-names>
          </string-name>
          , University of Helsinki,
          <year>2003</year>
          . pages
          <fpage>487</fpage>
          {
          <fpage>499</fpage>
          , San Francisco, CA, USA,
          <year>1994</year>
          . Morgan Kaufmann Publishers Inc. ISBN 1-55860- [11]
          <string-name>
            <given-names>Bart</given-names>
            <surname>Goethals</surname>
          </string-name>
          and
          <article-title>Mohammed Javeed Zaki. Ad153-8. vances in frequent itemset mining implementations: Introduction to FIMI03</article-title>
          . In FIMI '
          <volume>03</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Burdick</surname>
          </string-name>
          , Manuel Calimlim, Jason Flan- Frequent Itemset Mining Implementations, Pronick,
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Gehrke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tomi</given-names>
            <surname>Yiu</surname>
          </string-name>
          .
          <article-title>MAFIA: ceedings of the ICDM 2003 Workshop on FreA performance study of mining maximal frequent quent Itemset Mining Implementations, 19 Decemitemsets</article-title>
          .
          <source>In Proceedings of the IEEE ICDM Work- ber</source>
          <year>2003</year>
          , Melbourne, Florida, USA,
          <year>2003</year>
          .
          <article-title>URL shop on Frequent Itemset Mining Implementations</article-title>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>90</volume>
          /intro.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Karam</given-names>
            <surname>Gouda and Mohammed Javeed Zaki</surname>
          </string-name>
          . E - pages
          <volume>307</volume>
          {
          <fpage>316</fpage>
          , Washington, DC, USA,
          <year>2005</year>
          .
          <article-title>IEEE ciently mining maximal frequent itemsets</article-title>
          .
          <source>In Pro- Computer Society. ISBN 0-7695-2404-4. ceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01</source>
          , pages
          <fpage>163</fpage>
          {
          <fpage>170</fpage>
          , [21]
          <string-name>
            <surname>Takeaki</surname>
            <given-names>Uno</given-names>
          </string-name>
          , Tatsuya Asai, Yuzo Uchida, and Washington, DC, USA,
          <year>2001</year>
          . IEEE Computer So- Hiroki
          <string-name>
            <surname>Arimura</surname>
          </string-name>
          . LCM:
          <article-title>An e cient algorithm for ciety</article-title>
          .
          <source>ISBN 0-7695-1119-8</source>
          .
          <article-title>enumerating frequent closed item sets</article-title>
          .
          <source>In FIMI</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Fast algorithms for frequent itemset mining using FP-trees</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>17</volume>
          (
          <issue>10</issue>
          ):
          <volume>1347</volume>
          {
          <fpage>1362</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          <year>2005</year>
          . ISSN 1041-4347.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jiawei</surname>
            <given-names>Han</given-names>
          </string-name>
          , Jian Pei, Yiwen Yin, and
          <string-name>
            <given-names>Runying</given-names>
            <surname>Mao</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation: A frequent-pattern tree approach</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <volume>53</volume>
          {
          <fpage>87</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>ISSN 1573-756X</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Guimei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Hongjun Lu, and
          <article-title>Je rey Xu Yu. AFOPT: An e cient implementation of pattern growth approach</article-title>
          .
          <source>In In Proceedings of the ICDM workshop</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Palmerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          .
          <article-title>Statistical properties of transactional databases</article-title>
          .
          <source>In Proceedings of the 2004 ACM Symposium on Applied Computing, SAC '04</source>
          , pages
          <fpage>515</fpage>
          {
          <fpage>519</fpage>
          , New York, NY, USA,
          <year>2004</year>
          .
          <source>ACM. ISBN 1-58113-812-1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jian</surname>
            <given-names>Pei</given-names>
          </string-name>
          , Jiawei Han, Hongjun Lu, Shojiro Nishio,
          <string-name>
            <given-names>Shiwei</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dongqing</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>H-mine: hyperstructure mining of frequent patterns in large databases</article-title>
          .
          <source>In Proceedings 2001 IEEE International Conference on Data Mining</source>
          , pages
          <volume>441</volume>
          {
          <fpage>448</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Ganesh</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , William A.
          <string-name>
            <surname>Maniatty</surname>
            , and
            <given-names>Mohammed J.</given-names>
          </string-name>
          <string-name>
            <surname>Zaki</surname>
          </string-name>
          .
          <article-title>Feasible itemset distributions in data mining: Theory and application</article-title>
          .
          <source>In Proceedings of the Twenty-second ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, PODS '03</source>
          , pages
          <fpage>284</fpage>
          {
          <fpage>295</fpage>
          , New York, NY, USA,
          <year>2003</year>
          .
          <source>ACM. ISBN 1-58113-670-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ganesh</surname>
            <given-names>Ramesh</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mohammed J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          , and
          <string-name>
            <surname>William</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Maniatty</surname>
          </string-name>
          .
          <article-title>Distribution-based synthetic database generation techniques for itemset mining</article-title>
          .
          <source>In Proceedings of the 9th International Database Engineering &amp; Application Symposium</source>
          , IDEAS '
          <volume>05</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Takeaki</surname>
            <given-names>Uno</given-names>
          </string-name>
          , Masashi Kiyomi, and
          <string-name>
            <given-names>Hiroki</given-names>
            <surname>Arimura</surname>
          </string-name>
          .
          <article-title>LCM ver. 2: E cient mining algorithms for frequent/closed/maximal itemsets</article-title>
          .
          <source>In FIMI</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Mohammed</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zaki</surname>
          </string-name>
          .
          <article-title>Scalable algorithms for association mining</article-title>
          .
          <source>IEEE Trans. on Knowl. and Data Eng</source>
          .,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <volume>372</volume>
          {
          <fpage>390</fpage>
          , May
          <year>2000</year>
          . ISSN 1041-
          <fpage>4347</fpage>
          . doi:
          <volume>10</volume>
          .1109/69.846291.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Zijian</surname>
            <given-names>Zheng</given-names>
          </string-name>
          , Ron Kohavi, and
          <string-name>
            <given-names>Llew</given-names>
            <surname>Mason</surname>
          </string-name>
          .
          <article-title>Real world performance of association rule algorithms</article-title>
          .
          <source>In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01</source>
          , pages
          <fpage>401</fpage>
          {
          <fpage>406</fpage>
          , New York, NY, USA,
          <year>2001</year>
          . ACM. ISBN 1-58113-391-X.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Albrecht</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          .
          <article-title>The data problem in data mining</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <volume>38</volume>
          {
          <fpage>45</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>