<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Symposium on Advanced Database Systems, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generating Synthetic Discrete Datasets with Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Manco</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ettore Ritacco</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonino Rullo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Saccà</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Serra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Boise State University</institution>
          ,
          <addr-line>Boise, ID 83725</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIMES Department, University of Calabria</institution>
          ,
          <addr-line>Rende, CS, 87036</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for High Performance Computing and Networking (ICAR) of the Italian National Research Council (CNR)</institution>
          ,
          <addr-line>v. P. Bucci 8/9C, 87036 Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>The real data are not always available/accessible/suficient or in many cases they are incomplete and lacking in semantic content necessary to the definition of optimization processes. In this paper we discuss about the synthetic data generation under two diferent perspectives. The core common idea is to analyze a limited set of real data to learn the main patterns that characterize them and exploit this knowledge to generate brand new data. The first perspective is constraint-based generation and consists in generating a synthetic dataset satisfying given support constraints on the real frequent patterns. The second one is based on probabilistic generative modeling and considers the synthetic generation as a sampling process from a parametric distribution learned on the real data, typically encoded as a neural network (e.g. Variational Autoencoders, Generative Adversarial Networks).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Synthetic dataset</kwd>
        <kwd>Data generation</kwd>
        <kwd>Inverse Frequent Itemset Mining</kwd>
        <kwd>Constraints-based models</kwd>
        <kwd>Variational Autoencoder</kwd>
        <kwd>Generative Adversarial Networks</kwd>
        <kwd>Generative models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper is an extended abstract of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Emerging “Big Data” platforms and applications call for the invention of novel data analysis
techniques that are capable to efectively and eficiently handle large amount of data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There
is therefore an increasing need to use real-life datasets for data-driven experiments but the
scarcity of significant datasets is a critical issue for research papers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Synthetic data generation
can help in this, by reproducing the internal mechanisms and dependencies that justify the
occurrence of some specific pieces of information, and hence being able to replicate them
stochastically. In this paper we focus on the problem of generating high-dimensional discrete
data. Generating such data is a challenge because of the combination of both high dimensionality
and discrete components, which may result in a complex structural domain with lots of variety
and irregularities, not necessarily smooth. Throughout the paper, we shall study approaches to
data generation which rely on the idea that each point in the real domain can be mapped into a
suitable latent space and vice versa. These mappings guarantee a consistency with regards to
the original dataset. At the same time, the manifold in the latent space summarizes the main
characteristics of the data, that can hence be injected into the synthesized data in a controlled
way. We first discuss two main approaches, that are the Inverse Frequent Itemset Mining ( IFM)
and probabilistic generative modeling (PGM); finally, a comparison of the results obtained with
some of the presented algorithms are provided.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Inverse Frequent Itemset Mining-based Generative Models</title>
      <p>
        Let ℐ be a finite domain of  elements, also called items. Any subset  ⊆ ℐ is an itemset over
ℐ, also called a transaction. Let ℐ denote the set of all itemsets on ℐ; then, |ℐ | = 2. A
(transactional) database  is a set of tuples [, ], where  is the key and  is an itemset. The size
|| of  is the total number of its itemsets, i.e., transactions. A transactional database  is very
often represented as a bag of itemsets, i.e, the keys are omitted so that tuples are simply itemsets
and may therefore occur duplicated – in this case  is also called a transactional dataset. In
the paper we shall also represent an itemset  ∈  by its one-hot encoding x , that is a binary
vector of size  (the number of items in ℐ) such that its -th position  = 1 if the -th item in ℐ
is in , 0 otherwise. Consequently,  = {x 1, . . . , x  }, where  = ||, is the one-hot encoding
of the whole dataset . For each itemset  ∈ , there exist two important measures: (i) the
number of duplicates of , denoted as  (), that is the number of occurrences of  in , and (ii)
the support of , denoted as  (), that is the sum of the number of duplicates of each itemset
 in  containing , i.e.,  () = ∑︀∈∧⊆   ( ) . A dataset  can be represented in a
succinct format as a set of pairs (,  ()). Given ℐ = {, , , }, an example of dataset in the
succinct, one-hot format is shown in Table 1-. We say that  is a frequent (resp., infrequent)
itemset in  if its support is greater than or equal to (resp., less than) a given threshold. A
classical data mining task over transaction datasets is to detect the set of the frequent/infrequent
itemsets, and a rich literature deals with this topic [
        <xref ref-type="bibr" rid="ref4">4, 5, 6</xref>
        ] Given the threshold 50, the frequent
itemsets for the dataset of Table 1- are listed in Table 1-. The perspective of the frequent
itemset mining problem has been later inverted as follows: given a set of itemsets together with
their frequency constraints the goal is to compute, if any, a transaction dataset satisfying the
above constraints. This new problem is called the inverse frequent itemset mining problem (IFM)
algorithms [7], and has been later investigated also in privacy preserving contexts [8]. Given
a set ℐ of items, the IFM problem consists in finding a dataset  that satisfies given support
constraints on some itemsets  on ℐ — the set of such itemsets is denoted by . The support
constraints are represented as follows: ∀ ∈  :   ≤  () ≤  , where  () is
the sum of all number of duplicates of itemsets in  containing . As an example, consider
ℐ = {, , , },  = {{, }, {, }, {, }} and the support constraints represented in Table
1- – in this example minimal and maximal supports coincide. The itemsets 1 = {, } and
2 = {, } must occur in exactly 100 transactions (possibly as their sub-transactions) whereas
the itemset 3 = {, } must occur in exactly 50 transactions. It is also required that the dataset
size (i.e., the total number of transactions) be 170. The dataset 1 shown in Table 2- is feasible
as it satisfies all constraints: 1 is satisfied by the transactions {, , } and {, }, 2 by the
transactions {, , }, {, , } and {, } and 3 by the transactions {, , } and {, }.
      </p>
      <p>Let ′ be the set of all itemsets that are neither in  nor subsets of some itemset in . In
the example, ′ consists of {, , , }, {, , }, {, , }, {, , }, {, , }, {, }, {, } and
{, }. IFM does not enforce any constraint on the itemsets in ′ and, therefore, it may happen
that  contains additional (and, in some cases, unsuspected or even undesired) frequent itemsets.
In the dataset 1 of Table 2-, the itemset {, , } is in ′ but it turns out to be frequent with a
support of 70. To remove the anomaly, Guzzo et al. [9] have proposed an alternative formulation,
called IFM, that requires that only itemsets in  can be included as transactions in  and,
therefore, no unexpected frequent itemsets may eventually occur. Obviously, the decision
complexity of this problem is lower as it is NP-complete. Despite the complexity improvement,
the IFM formulation has a severe drawback: it is too restrictive in excluding any transaction
besides the ones in  as confirmed by the fact that no feasible dataset exists for our running
example. To weaken the tight restrictions of IFM, Guzzo et al. [10] proposed a new formulation
of the problem, called IFM with infrequency support constraints (IFMI for short), which admits
transactions in ′ to be in a feasible dataset if their supports are below a given threshold  ′. By
the anti-monotonicity property, the number of infrequency support constraints can be reduced
by applying them only to a subset of ′ consisting of its minimal (inclusion-wise) elements. This
subset, denoted by ′ , is called the negative border and coincides with the set of all minimal
transversals of the hypergraph ¯ = {ℐ ∖  :  ∈ } (as defined in [ 11]). In the example,
′ = {{, }, {, }, {, }} and the dataset 2 in Table 2- is a feasible dataset for IFMI for
 ′ = 40. In fact, all infrequency support constraints on ′ are satisfied as the supports of
{, }, {, }, {, } are respectively 40, 0 and 40. Another possibility to enforce infrequency
constraints is to fix a duplicate threshold  ′ so that an itemset in ′ is admitted as transaction
in a feasible dataset if its number of occurrences is at most  ′. This formulation has been given
in [12] with the name of IFM with infrequency duplicate constraints (IFMD for short). Observe
that duplicate constraints are less restrictive than infrequency constraints in the sense that
some itemset  in ′ may happen to be eventually frequent as it may inherit the supports of
several itemsets in ′ with duplicates below the threshold. For instance, given the threshold
 ′ = 30, the dataset 3 in Table 2- is a feasible dataset for IFMD. However, the supports of
{, }, {, }, {, } are respectively 60, 30 and 50, thus {, } and {, } are frequent.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Machine Learning-based Generative Models</title>
      <p>The probabilistic approach to model transactional data (PGM - probabilistic generative modeling)
assumes that in a database  the itemsets are modeled as stochastic events: that is, they are
sampled from an unknown true distribution P. The analysis of the statistical distribution of the
stochastic events provides insights on the mathematical rules governing the generation process.
The problem hence becomes how to obtain a smooth and reliable estimate of P.</p>
      <p>In general, it is convenient to use a parametric model to estimate P when the constraints on
the shape of the distribution are known. By associating each observation x with a probability
measure  (x | ) ≡  (x ) where  is the set of the distribution parameters, our problem
hence becomes to devise the optimal parameter set  that guarantees a reliable approximation
P ≈ P that can emulate the sampling process x ∼ P in a tractable and reliable way. A
clear advantage of a parametric approaches to data generation lies in the insights that it can
provide within the data generation process. They allow to detect the factors governing the
data, providing a meaningful explanation of complex phenomena. In this work, we are focusing
on two state-of-the-art probabilistic approaches: Variational Autoencoders and Generative
Adversarial Networks.</p>
      <p>A Variational Autoencoder (VAE) is an artificial neural architecture that combines
traditional autoencoder architectures [13] with the concept of latent variable modeling [14].
Essentially, we can assume the existence of a a -dimensional latent space , that can be the
generation engine of the samples in  . The transactions  = {x 1, . . . , x  } can be modeled
through a chain dependency: (i) given a distribution  (over the parameter set  ) we can
sample z ∈ , and (ii) given a z and another distribution  (over the parameter set ) we
can sample x . According to the maximum likelihood principle, optimal values of  and  can
be found by trying to maximizing  ( ), but, unfortunately, this optimization is typically an
intractable problem that requires the exploitation of heuristics.</p>
      <p>Variational inference [15] introduces a proposal normal distribution  (z |x ) (over the
parameter set  ), whose purpose is to approximate the true posterior  (z |x ). Hence, a VAE
is devised by concatenating two neural networks: an “Encoder”, that maps an input x into
a latent variable z , exploiting  (z |x ), and a “Decoder” that reconstructs x by applying
(x |z ) to z . The gain function, to be optimized to learn the network parameters  and
, is obtained by marginalizing the log-likelihood of  ( ) over z and applying Jensen’s
inequality: log  (x ) ≥ Ez ∼ [log  (x |z )] − KL [(z |x )‖ (z )], where KL is the
KullbackLeibler divergence.</p>
      <p>The Decoder can be opportunely exploit to solve the discrete data generation problem. In fact,</p>
      <p>Embedding plot
we can sample as many z as we want, feed them to the Decoder and obtain brand new synthetic
data that is similar to the real one, if the VAE was properly trained. A simple generation approach,
called Multinomial VAE (MVAE), was proposed in [16], where x ∼ Multinomial ( (z )), with
 (z ) = softmax {exp [(z )]} and z ∼  (0 , I  ). A more sophisticated approach, that can
better observe the latent space of z and overcome the strong bias of considering only one
standard normal distribution, is implemented by clustering all the z ∼  ( ). The generation
of a synthetic x starts by choosing a cluster , by exploiting a multinomial sampling according
to the clusters’ densities. Then, we can apply the MVAE sampling to pick up z ∼  ( ,  ),
where   and   are the mean and standard deviation within . A comparison of these two
approaches is shown in Figure 1. Unfortunately, the approaches based on maximum likelihood
have been shown to sufer from over generalization [ 17], especially when data from P is limited.</p>
      <p>Generative Adversarial Networks (GANs) [18] propose an alternative modeling which
departs from the maximum likelihood and instead focuses on an alternative optimization strategy.
In order to optimize the weights  of a neural network, called Generator G, able to learn the
probability space P , Adversarial Networks rely on an auxiliary classifier , with weights ,
trained to discriminate between real and generated data. In practice, optimality can be achieved
when x ∼ P is indistinguishable from x ∼ P. The training process can be hence devised as a
competitive game, with the generator trying to produce realistic samples, starting from random
samples in the latent space , and the classifier focusing on the detection of generated data,
where the objective function is min max Ex ∼ P [log (x )] + Ex ∼ P [log(1 − (x ))].</p>
      <p>It can be shown [18] that the adoption of this alternate optimization is equivalent to train 
to minimize the Jensen-Shannon divergence, that, diferently from the approaches based on
maximum likelihood, has the objective of a complete adherence of P and P . In our context,
transactions are discrete vectors, with x ∈ {0, 1}, thus backpropagation does not directly
apply to G and a workaround is needed. The simplest one is to admit a continuous relaxation of
the output of the generator. Just like with the variational autoencoder, the output of  can be
modeled a multinomial probability (with no direct sampling channel), rather than a binary vector.
However, there is a major problem with this: the input of the discriminator would be a softmax
distribution from the generated transactions, and a binary vector for the real transactions. As
a result, the discriminator could easily tell them apart, with the result that the GAN would
get stuck in an equilibrium that is not good for the Generator. Diferent formulations of the
adversarial training (based on Wasserstein distance [19], namely Wasserstein GANs or WGANs)
can partially mitigate this issue.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Comparative Analysis</title>
      <p>In this section we compare the two approaches discussed in sections 2 and 3 by analyzing their
behavior on a set of controlled experiments. For these, we use a toy dataset, shown in Table 3,
comprising 4 items upon which 10 patterns are selected. The dataset used in the experiments is
hence built by replicating such patterns with some fixed frequencies. We use the following
approaches to generate the synthetic datasets:
- IFM: IFM formulation with support ≥ 250.
- IFM : IFM formulation with infrequent itemsets’ support ≤ 210.
- IFM: IFM formulation with transaction duplicates ≤ 100.
- IFM : IFM merged with IFM .
- IFM− 5%: IFM formulation imposing that each transaction has a number of duplicates
that difer less than 5% from the number of duplicates in the original dataset.
- VAE: Variational Auto Encoder generator.
- VAE−{ 4, 6}: VAE without the generation of 4 and 6 transactions (as visible in Table 3)
by means of sampling with rejection.
- IFM− 5% − { 4, 6}: IFM− 5% formulation imposing the number of duplicates for 4
and 6 to be zero.</p>
      <p>The purpose of the experiments is to observe the reconstruction process in both methods
and compare the resulting reconstructed datasets. The comparison relies on the transactions
(1 . . . 10 in the table) as well as both simple items and item pairs. We evaluate the faithfulness
of the reconstruction in two respects: (1) whether the patterns are reproduced, and (2) whether
their frequencies are faithful. A simple metric to measure the reconstruction accuracy is the
discrepancy , computed as  = ∑︀∑|︀− | , where, for a pattern  (either a transaction or an
item pair),  and  represent the frequency of the pattern in the reconstructed and original
dataset, respectively. Table 4 reports the values of discrepancy, which are further detailed in
Figures 3-5.</p>
      <p>We first analyze the results of the reconstruction for a VAE-based generative model. Figure
3 shows a comparison with IFM− 5%, the IFM-based formulation which provides the best
performances. The reconstruction provided by IFM− 5% is extremely faithful, both on the
itemsets and the transactions. This because it enforces that the number of duplicates of each
transaction difers less than 5% from the number of duplicates in the original dataset. Figure 2
shows the details of how the original patterns are mapped into the latent generative space:
the leftmost picture shows the mapping of the original data, and the rightmost shows how
the generation results from a larger region of the latent space. The main advantage of the
approach based on generative modeling through latent variables is that the latter allows to
control the reconstruction process. By acting on the latter we can modify the characteristics of
the reconstructed space. For example, we see that transaction 11 (the only spurious transaction
generated by VAE) is placed in a specific region, denoted by a red star in the figure. Sampling
repeatedly from that region would allow us to change the overall distribution of the transactions
while still maintaining the itemset distribution. By contrast, the IFM-based approaches are in
general successful in maintaining the itemset distributions. However, they tend to produce a
higher noise with transactions unless not explicitly constrained by the IFM− 5% formulation
(see Figure 4). This noise can in principle be considered an advantage in specific contexts where
a diferentiation from the original dataset is required (e.g., due to privacy concerns).</p>
      <p>In principle, the adoption of IFM allows to implement a reconstruction "by design", by
choosing which itemsets to maintain or suppress. As an evidence, we report the cases of
IFM− 5% − { 4, 6} and VAE−{ 4, 6}, where transaction 4 and 6 are removed from the
generation phases. In fact, Figure 5 shows that IFM− 5% − { 4, 6} keeps the number of
duplicates and the supports almost similar to the ones of the original dataset, while VAE−{ 4, 6}
changes many of them to remove the two transactions. The approach based on generative
modeling is in general more eficient. However, constraint-based generation is sensitive to the
frequency threshold and a suitable tuning can make these approaches comparable.</p>
      <p>To summarize, these experiments support an underlying intuition: Constraint-based
generation allows more control on the expected outcome at the expense of a higher computational
cost, whereas probabilistic generative models provide more faithful reconstructions but are less
controllable. This essentially means that, without any further modeling artifact (that we do
not consider here), generative models are prone to fail in providing tailored reconstructions
where some patterns can be suppressed and new ones introduced. By contrast, contraint-based
generation is more suitable for reconstructions "by design".</p>
      <p>duplicates comparison IFM-VAE w/o { t4,t6}
support comparison IFM-VAE w/o { t4,t6}
700
500
300
100</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>This paper has provided an overview about state-of-the-art approaches for synthetic
transactional data generation. A transaction has been modeled as a high-dimension sparse itemset,
that can be mapped into a binary vector, defined over the item space. The investigated
algorithms are (variants of the) Inverse Frequent Itemset Mining (IFM), and Probabilistic Generative
Models (PGMs). According to our analysis, the IFM approaches result to be extremely flexible
and understandable; they enable the control of the data generation procedure by the direct
identification of the discovery patterns to preserve. However, they proved to have extremely
onerous computational costs, making them not feasible in high-dimensional contexts. An
opposite conclusion has been obtained by analyzing PGMs: they are extremely fast and accurate,
but strongly lacking in control, flexibility and understandability. As future work, an interesting
research line is trying to investigate novel methodologies and techniques that are able to take
advantage of IFM and PGMs, by combining their strong points and mitigating their weakness.
Another promising research line is to apply the combination of the two approaches to NoSQL
applications by considering the extension of IFM that has been recently proposed in [12].
[5] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and
future directions, Data Mining and Knowledge Discovery 15 (2007) 55–86. doi:10.1007/
s10618-006-0059-1.
[6] L. Cagliero, P. Garza, Itemset generalization with cardinality-based constraints,
Information Sciences 244 (2013) 161 – 174. doi:https://doi.org/10.1016/j.ins.2013.05.
008.
[7] T. Mielikainen, On inverse frequent set mining, in: Proceedings of 2nd Workshop on
Privacy Preserving Data Mining, PPDM ’03, IEEE Computer Society, Washington, DC,
USA, 2003, pp. 18–23.
[8] R. Agrawal, R. Srikant, Privacy-preserving data mining, in: Proceedings of the 2000 ACM
SIGMOD International Conference on Management of Data, SIGMOD ’00, ACM, New
York, NY, USA, 2000, pp. 439–450. doi:10.1145/342009.335438.
[9] A. Guzzo, D. Saccà, E. Serra, An efective approach to inverse frequent set mining, in:
Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09,
IEEE Computer Society, Washington, DC, USA, 2009, pp. 806–811. doi:10.1109/ICDM.
2009.123.
[10] A. Guzzo, L. Moccia, D. Saccà, E. Serra, Solving inverse frequent itemset mining with
infrequency constraints via large-scale linear programs, ACM Transactions on Knowledge
Discovery from Data 7 (2013) 18:1–18:39. doi:10.1145/2541268.2541271.
[11] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, Data mining, hypergraph transversals,
and machine learning, in: Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems, PODS ’97, 1997, pp. 209–216. doi:10.
1145/263661.263684.
[12] D. Saccà, E. Serra, A. Rullo, Extending inverse frequent itemsets mining to generate realistic
datasets: complexity, accuracy and emerging applications, Data Mining and Knowledge
Discovery 33 (2019) 1736–1774. doi:10.1007/s10618-019-00643-1.
[13] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: In Proceedings
of the International Conference on Unsupervised and Transfer Learning workshop (UTLW),
volume 27, PMLR, 2011, pp. 37–49.
[14] K. P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012.
[15] D. M. Blei, A. Kucukelbir, J. D. McAulife, Variational inference: A review for
statisticians, Journal of the American Statistical Association 112 (2017) 859–877. doi:10.1080/
01621459.2017.1285773.
[16] D. Liang, R. G. Krishnan, M. Hofman, T. Jebara, Variational autoencoders for collaborative
ifltering, in: Proceedings of the 2018 World Wide Web Conference, WWW ’18, 2018, pp.
689–698.
[17] L. Theis, A. van den Oord, M. Bethge, A note on the evaluation of generative models,
in: International Conference on Learning Representations, 2016. doi:10.48550/arXiv.
1511.01844.
[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing
Systems, volume 27, 2014.
[19] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in:
Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 214–223.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Manco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ritacco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saccà</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Serra,</surname>
          </string-name>
          <article-title>Machine learning methods for generating high dimensional discrete datasets</article-title>
          ,
          <source>WIREs Data Mining Knowl. Discov</source>
          .
          <volume>12</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1002/widm.1450. doi:
          <volume>10</volume>
          .1002/widm.1450.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Chen</surname>
          </string-name>
          , C.-Y. Zhang,
          <article-title>Data-intensive applications, challenges, techniques and technologies: A survey on big data</article-title>
          ,
          <source>Information Sciences 275</source>
          (
          <year>2014</year>
          )
          <fpage>314</fpage>
          -
          <lpage>347</lpage>
          . doi:https: //doi.org/10.1016/j.ins.
          <year>2014</year>
          .
          <volume>01</volume>
          .015.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <article-title>Where's the data in the big data wave</article-title>
          ?,
          <year>2013</year>
          . ACM Sigmod Blog: http://wp.sigmod.org/?p=
          <fpage>786</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imieliński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swami</surname>
          </string-name>
          ,
          <article-title>Mining association rules between sets of items in large databases</article-title>
          ,
          <source>in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD '93</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>1993</year>
          , pp.
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>