<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MTCopula: Synthetic Complex Data Generation Using Copula</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fodil Benali, Damien Bodénès</string-name>
          <email>{fbenali,dbodenes}@adwanted.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Labroche, Cyril de Runz</string-name>
          <email>cyril.derunz@univ-tours.fr</email>
          <email>nicolas.labroche@univ-tours.fr</email>
          <email>{nicolas.labroche,cyril.derunz}@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adwanted Group</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>BDTLN - LIFAT, University of Tours</institution>
          ,
          <addr-line>Blois</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, marketing strategies are data-driven, and their quality depends significantly on the quality and quantity of available data. As it is not always possible to access this data, there is a need for synthetic data generation. Most of the existing techniques work well for low-dimensional data and may fail to capture complex dependencies between data dimensions. Moreover, the tedious task of identifying the right combination of models and their respective parameters is still an open problem. In this paper, we present MTCopula, a novel approach for synthetic complex data generation based on Copula functions. MTCopula is a flexible and extendable solution that automatically chooses the best Copula model, between Gaussian Copula and T-Copula models, and the best-fitted marginals to catch the data complexity. It relies on Maximum Likelihood Estimation to fit the possible marginal distribution models and introduces Akaike Information Criterion to choose both the best marginals and Copula models, thus removing the need for a tedious manual exploration of their possible combinations. Comparisons with state-of-art synthetic data generators on a real use case private dataset, called AdWanted, and literature datasets show that our approach preserves better the variable behaviors and the dependencies between variables in the generated synthetic datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Nowadays, data are the new gold. Unfortunately, it is dificult to
get this valuable data as sometimes companies do not have the
means to collect large data sets relevant to their business. Others
have dificulties sharing sensitive data due to the business
contract confidentiality or record privacy [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], which is the case of
ad planning, our industrial context. In this specific context, only
very few high quality and complex data (multidimensional,
multivariate, categorical/continuous, time series,  .), supposedly
representative of the whole dataset, are available for generating
a large and realistic synthetic dataset. Therefore, there is a true
need for a realistic complex data generator.
      </p>
      <p>
        Our objective is to generate new data that maintains the same
characteristics as the original data, such as the distribution of
attributes and dependency between them. Moreover, it must be
structurally and formally resembling the original data so that any
work done on the original data can be done using the synthetic
data [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. This cannot be done using the usual one-dimensional
synthetic data generation [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] method because, when applying
it in a high dimensional context, it does not allow to model the
dependency between variables. To tackle those issues, several
recent works focused on deep learning approaches such as
Generative Adversarial Network (GAN), but those approaches require
a large amount of data for the learning step and thus can not be
used for our problem.
      </p>
      <p>
        Nevertheless, recently, there has been a growing interest in
Copula-based models for estimating [
        <xref ref-type="bibr" rid="ref1 ref26">1, 26</xref>
        ] and sampling [
        <xref ref-type="bibr" rid="ref10 ref29">10, 29</xref>
        ]
from a multivariate distribution function. Copula [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are joint
probability distributions in which any univariate continuous
probability distribution can be plugged in as a marginal. The
Copula captures the joint behavior of the variables and models
the dependence structure, whereas each marginal models the
individual behavior of its corresponding variable. Thus, our
problem turns into building a joint probability distribution that
best fits the marginal distribution of each variable and allows
capturing diferent dependencies between these variables. This
problem is often understood as a structure learning task that can
be solved in a constructive way while attempting to maximize
the likelihood or some information theory criterion [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>Copula is a flexible mathematical tool that can support
diferent configurations in terms of marginal fitting distribution and
copula models. To choose the best configuration is not simple.
For instance, the literature Copula-based data generators use
Gaussian Copula model but this model has dificulties to
capture tail dependencies, which may afect the quality of the data
generation.</p>
      <p>In this work, we present MTCopula, a flexible and extendable
Copula-based approach to model and generate complex data
(e.g., multivariate time series) with automatic optimization of
Copula configurations. Our contributions are the following: ( 1)
we formalize the problem of synthetic complex data generation,
(2) we propose an approach MTCopula to learn Copulas and
automatically choose the marginals and Copula models that best
ift the data we want to generate, and ( 3) we describe experiments
showing how well MTCopula preserves implicit relationships
between variables in the synthetic datasets on a real use case and
state-of-the-art datasets.</p>
      <p>This paper is organized as follows: Section 2 presents the
related works. Sections 3 and 4 introduce the main concepts
related to dependency structures and Copulas. Section 5 provides
the problem description while Section 6 describes MTCopula,
our solution to model and generate data with their structure
dependencies. Section 7 presents the experiments performed to
show the properties and the eficiency of our approach. Finally,
Section 8 presents the conclusion and opens future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>The fundamental idea of the process of synthetic data generation
involves sampling data from a pre-trained statistical model, then
use the sample data in place of the original data. In this section,
we study related works with regard to this preliminary notion
and our problem, which is the generation of synthetic complex
data. Complex data denotes a case where data can be a mixture
of continuous and categorical variables, in a high
dimensional context, and with the possibility of having temporal
relations in the order of variables (time series) and dependencies
in variables’ distributions tails.</p>
      <p>First, our problem is not about generating data from
specifications: it is rather about generating synthetic data from real
data samples, which, for diferent reasons, are generally available
in small quantities but with good quality. Therefore approaches
such as AutoUniv1 cannot be applied.</p>
      <p>
        Second, in the simplest case of one-dimensional synthetic data
generation, sampling from a random variable  with a known
probability distribution  is usually done using the classical
approach Inverse Transform Sampling (ITS) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], in which
pseudorandom samples 1, ..., are generated from a uniform
distribution  on [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] and then transformed by −1 (1), ...,  −
1 ( ).
      </p>
      <p>The issue with applying such an approach in high dimensional
synthetic data generation is that it will not allow modeling the
dependency between variables. As a consequence, it generates an
independent joint distribution. Therefore, this approach cannot
capture the dependency structure, which is one of our problem’s
key elements.</p>
      <p>
        Then, traditionally, a perturbation technique, called General
Additive Data Perturbation (GADP) has been widely used for
synthetic data generation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The principle consists in fitting a
multivariate Gaussian distribution on the input data,  ∼ N (, Σ).
After that, the estimated multivariate Gaussian variable  is used
to generate the synthetic data  by adding a noise variable ,  =
 + . where  is a Gaussian error. The problem with this method
is that it does not allow us to best model the marginal behaviors
of variables since it considers only Gaussian marginal
distributions by construction, which can be limiting as observed in our
experiments. Moreover, it does not model the tail dependence
as is consider the correlation matrix Σ only. Another variant of
GADP is the Dirichlet multivariate synthesizer based on MLE
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The problem with MLE for multivariate distribution fitting
is that it has to be maximized over a potentially high-dimensional
parameter space, which is computationally very expensive.
      </p>
      <p>
        The rise of deep learning in the last years has brought forth
new machine learning techniques such as generative adversarial
networks (GANs)[
        <xref ref-type="bibr" rid="ref18 ref23">18, 23</xref>
        ]. These techniques perform better than
state-of-the-art works in many fields but require large datasets for
training, which can be a significant problem because collecting
data is often expensive or time-consuming. Even when data is
already collected, this type of method cannot be applied due to
privacy or confidentiality issues. Moreover, GANs, like most of
deep learning approaches, act as a black-box and does not allow a
business expert to understand how the synthetic data are actually
generated.
      </p>
      <p>
        Recently, there has been a growing interest in Copula-based
modeling and synthetic data generation. Despite the fact that
Copula models can best model dependencies and the marginal
behaviors of variables, most contributions suggested for synthetic
data generation [
        <xref ref-type="bibr" rid="ref10 ref19">10, 19</xref>
        ] have focused on a single model: the
Gaussian Copula. However, this model assumes a structure
dependency that may only loosely capture the interaction between
variables [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as it does not allow to model the tail dependence.
In addition, these contributions use the Pearson correlation
factor to estimate the correlation matrix, which is not invariant
under strictly monotone non linear transformation, and while
this hypothesis is crucial in the Copula’s context. As a
consequence, this impacts structure dependency preservation during
the copula learning fitting. Nevertheless, Copulas with both
marginal fittings and its dependency structure allow for a transparent
explanation of the generated data.
      </p>
      <p>In conclusion, Copulas seems to be the best solution for
generating datasets based on complex tiny real datasets, but there is</p>
      <sec id="sec-2-1">
        <title>1https://archive.ics.uci.edu/ml/datasets/AutoUniv</title>
        <p>a need for parameter calibration automation. Before introducing
the Copula, we present the dependency structure notions in the
next section.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>DEPENDENCY STRUCTURES</title>
      <p>One of our goals is to capture the dependency structure
relationship D between data/variables to finally be able to generate data
respecting those dependencies. This section focuses on the main
measures used to summarize dependency between components
of a random vector.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Pearson Product–Moment Correlation</title>
      <p>
        The Pearson product-moment correlation  is a measure of the
linear relationship between two random variables 1, 2. A
relationship is linear when a change in one variable is associated with
a proportional change in the other variable. Pearson correlation
takes values in the interval [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ], and it is defined as:
 (1, 2)
 (1, 2) =  (1, 2) = p  (1)p  (2) .
(1)
      </p>
      <p>
        The problem with    is that it is not
invariant under non-linear strictly increasing transformations of
the marginals [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Rank Correlation</title>
      <p>In practice, we have a monotonic relationship between
measurements in which variables tend to change together, but not
necessarily at a constant rate. In this case, rank correlation statistics are
well suited for determining whether there is a correspondence
between random variables. We mention here the two important
rank correlation measures, namely  and  .</p>
      <p>Definition 3.1 (Spearman  correlation). Let (1,2) be a
bivariate random vector with continuous marginal dfs 1 and 2.
The Spearman’s factor  is defined by:</p>
      <p>(1, 2) =  (1 (1), 2 (2)).</p>
      <p>Definition 3.2 (Kendall’s  correlation). Kendall’s  is defined
as the probability of concordance minus the probability of
discordance of two random variables 1 and 2:
 (1, 2) =  ( (11, 21) (12, 22) &gt; 0)−</p>
      <p>( (11, 21) (12, 22) &lt; 0),
where (11, 21) and (12, 22) are independent and identically
distributed copies of (1, 2).</p>
      <p>
        Both Kendall’s  and Spearman’s  are dependence invariant
with respect to monotone transformations of the marginals. Their
range of values is the interval [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Tail Dependence</title>
      <p>
        Understanding the dependence structure of rare events is
fundamental in order to best model random variables behaviors.
Measures of dependence like   , 
and   are not able to correctly capture and
characterize the joint occurrence of large and small values of
random variables [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The Pearson correlation describes how
well two random variables are linearly correlated with respect to
their entire distribution. However, this information is not useful
to model the extreme behavior of two random variables [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
(2)
(3)
      </p>
      <p>To evaluate tail dependence, the tail dependence coeficient is
calculated as follows:</p>
      <p>Definition 3.3 (Upper and lower tail dependence coeficient). The
upper tail dependence coeficient of a bivariate distribution is
defined as:
 = lim  (2 &gt; 2−1 ( ) |1 &gt; 1−1 ( )).</p>
      <p>→1−
The lower tail dependence coeficient is:
 = lim  (2 ≤ 2−1 ( ) |1 ≤ 1−1 ( )).</p>
      <p>→0+</p>
      <p>Using those definitions, we are now able to introduce the
Copula on which our approach is based.
(4)
(5)
4</p>
    </sec>
    <sec id="sec-7">
      <title>COPULA</title>
      <p>
        This section is devoted to summarizing Copula principles as they
are the key part for data generation that conserve dependencies.
A deeper explanation about copula can be found in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
4.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>Copula Foundations</title>
      <p>
        A  is a Latin term which means . In recent years, due
to its ability to catch the core of multivariate data distributions
and their dependencies, copula was applied in a wide range of
areas such as econometric modeling [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and quantitative risk
management [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        This concept was first introduced in statistical modeling in
1959 by  [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] to describe the function that “join together”
one-dimensional distribution functions to form a multivariate
distribution function. It is based on Sklar’s Theorem 4.1.
      </p>
      <p>
        Theorem 4.1 (Sklar’s theorem). Let (1, ...,   , ...,  ) be a
d-dimensional random vector with joint distribution function 
and marginal distribution functions  ,  = 1,..., , then there exists
a d-copula  : [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], such that for all  in R , the joint
distribution function can be expressed as:
 (1, ..,   , .. ) =  (1 (1), ..,   (  ), .. ( ))
(6)
with associated density function ℎ, expressed by the
multiplication of the copula density function  and marginal densities:
      </p>
      <p>
        Conversely, Copula  corresponding to a multivariate
distribution function  which marginal distribution functions  for  =
1,.., , can be expressed as:
 (1..,  ) =  (1−1 (1).., −1 ( )) , ∀(1, ..,  ) ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] (8)
where  =  ( ) and −1 is the inverse of the marginal
distribution function of  .
      </p>
      <p>The first equation of the Sklar’s Theorem (Eq.6) describes the
role of the Copula function which is connecting or coupling the
marginal distribution functions 1,...,  to form the multivariate
distribution function  . This allows large flexibility in
constructing statistical models by considering, separately, the univariate
behavior of the components of a random vector and their
dependence properties captured by some copulas. In particular, Copulas
can serve for modeling situations where a diferent distribution is
needed for each marginal, providing a valid substitute to several
classical multivariate distribution functions such as Gaussian,
Laplace, Gamma, Dirichlet, etc. This particularity represents one
of the main advantages of the Copula’s concept, as explained by
ℎ (1, ...,  ) =  (1 (1), . . . ,  ( )) ×
 ( ).</p>
      <p>(7)
Ö
=1</p>
      <p>
        Mikosch [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]: “[Copula] generate all multivariate distributions
with flexible marginals”.
      </p>
      <p>
        Equation 8 describes the construction of the Copula that
captures and estimates dependence between the standardized
variables [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A typical example of this construction is the Gaussian
Copula, which is obtained by taking G in (Eq.8) as the
multivariate standard Gaussian d.f. This illustrates the founding principle
of Copula that states that the dependence of data can be modeled
independently from the marginals. It is thus possible to represent
diferent original distributions just by changing the marginal
distributions.
      </p>
      <p>Real-world high dimensional data may have diferent marginals
and joint distributions. Therefore, Copulas seem to be the right
tools to overcome these dificulties.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>The Invariance Principle Of Copula</title>
      <p>Here, we would like to mention one of the principal properties
of copulas inferred from  ’ Theorem 4.1. This theorem is
central for data generation using copula as it guarantees that
the normalization applied on marginals by their respective
cumulative distribution functions  , does not alter the measure of
dependence between the variables that we want to capture with
the copula.</p>
      <p>Theorem 4.2 (Invariance Principle of Copula). Let  = (1,
...,   ,...,  ) be a d-dimensional random vector with continuous
joint distribution  , marginal distribution functions  ,  = 1,..., 
and a copula . Let  1,...,  be strictly increasing transformations
on range 1, ..,  respectively. Then  is also the copula of the
random variable ( 1 (1), ...,    (  ),...,   ( )).</p>
      <p>Thus, Copulas, that describe the dependence of the
components of a random vector, are invariant under increasing
transformations of each variable. The power of this theorem manifests
itself when moving from the multivariate distribution function
( ) to the corresponding random vectors ( ). In particular, when
we want to sample from a multivariate distribution function. It
gives us guarantees about dependency preservation when
standardizing variables with their marginal distributions in order
to capture dependency by taking   = (cumulative
distribution functions  are strictly increasing by construction). After
that in order to return to the original data shape, we apply the
inverse distribution −1 (or the quasi-inverse) by taking   =
(−1 ) ( ) is a strictly increasing transformation in the range
of  .
4.3</p>
    </sec>
    <sec id="sec-10">
      <title>Families Of Copulas</title>
      <p>
        In practice, there are many bivariate Copula families like the
elliptical copulas, archimedean Copulas, and extreme-value
Copulas [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but only a few multivariate ones. This section focuses on
the elliptical family because it contains two multivariate Copulas,
the Gaussian Copula, and T-Copula.
      </p>
      <p>4.3.1</p>
      <p>Multivariate Gaussian Copula.</p>
      <p>Definition 4.3 (Multivariate Gaussian Copula). The
multivariate Gaussian Copula is the result of applying the inverse
statement of Sklar’s theorem (Eq.8) to the multivariate Gaussian
distribution with zero mean vector and correlation matrix  .</p>
      <p>
        The main drawback of Gaussian Copula is that it does not
allow to capture tail dependence. The upper and the lower tail
dependence coeficient between two variables (  ,   ) with
correlation factor , are the same and are given by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
(a) Gaussian Copula.
      </p>
      <p>(b) T-Student Copula .</p>
      <p>
        In this case, considering (Eq.8),  corresponds to the
multivariate T-Student d.f  (··· ;  , ) with scale parameter matrix
 ∈ [
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ]× and &gt;0 degree of freedom. Further −1 is the
inverse of the univariate standard student c.d.f.  . The main
advantage of the T-Copula comparing to the Gaussian Copula is
its ability to capture the tail dependence among extreme values
 between two
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The upper tail dependence coeficient  
variables ( ,   ) is equal to lower tail dependence coeficient
 , because T-Copula is symmetric and is given by:
√
  = 2+1 −  + 1
p1 −   !
p1 +  
.
      </p>
      <p>
        Copula Φ and it is defined by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]:
      </p>
      <p>4.3.3 Illustration. To compare T-Copula and Gaussian
Copula’s ability to capture tail dependence, Figure 1 shows two scatter
plots that represent a bivariate distribution constructed using the
two mentioned Copulas.</p>
      <p>One important common characteristic in this comparison is
that both Copulas use the Kendall’s  of two random variables ( ,
  ) that has the same form for both T-Copula 
, and Gaussian

  =  ( 2  (,   )),
where   is the Pearson correlation between the pair ( ,   ).</p>
      <p>As we can notice from the lower left and upper right corners of
the two scatter plots, the constructed bivariate distributions have
significantly diferent behavior in their bivariate tails, although
they have the same marginals and correlation factor. In fact, in
the Gaussian Copula (left scatter), there seems to be no strong
dependence in the lower left and upper right corners, while the
T-Copula with three degrees of freedom (right scatter) emerges
to have more mass and more structure in the lower and upper
tail.
4.4</p>
    </sec>
    <sec id="sec-11">
      <title>Copula Learning</title>
      <p>Estimating Copula  as in (Eq.6) that belongs to a parametric
family of Copulas  such as the  and  Copula, consists in
estimating the vector  of unknown parameters. If the marginal
distribution 1, ..., are known, the following sample would
represent independent, identically distributed () random samples
of Copula.
(10)
(11)</p>
      <p>= (1 (1), ...,  ( )),  ∈ {1, ..., }. (12)</p>
      <p>Consequently,  could be estimated using data distribution
iftting techniques such as Maximum Likelihood Estimation (MLE).
However, in reality, the marginals of  are unknown. For this
reason, the marginals have to be estimated before that  can be
estimated. The Copula learning process, schematized in Figure
2, is structured in two steps – Marginal Distribution Fitting, and
Copula Fitting – that are described in the following.</p>
      <p>
        4.4.1 Marginal Distribution Fiting. Modeling marginal
distribution 1, ...,  can be achieved commonly in two ways [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]:
the first approach consists in fitting parametric distribution to
each marginal, i.e., we assume   ∼  (.;  ), the parameter   is
commonly estimated by maximum likelihood:
      </p>
      <p>ˆ := max Ö  (  ;   ),  ∈ {1, ...,  }.</p>
      <p>=1</p>
      <p>The associated marginal distribution function   is then
estimated by   (.; ˆ ). The second approach consists of modeling
the non-parametric marginals using the empirical distribution
function ˆ defined as:
ˆ
  ( ) =
1</p>
      <p>Õ 1{ ≤ }     .</p>
      <p>+ 1 =1
4.4.2 Copula Fiting. In both previous cases, we end up with
data on the Copula scale, which will be used to estimate the
Copula parameters  of the chosen multivariate Copula family:
(1, ...,  ) = (ˆ1 (1), .., ˆ ( )),    = 1, .., .</p>
      <p>
        Similar to marginal distribution parameters estimation, one
method is Maximum likelihood estimation, which is commonly
used to estimate the parameters vector  of the Copula-based
on pseudo-Copula data. If parametric marginal models (Eq.13)
are used, then we talk about inference for marginals approach
(IFM)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and if the empirical distribution of (Eq.14) is applied
then we have a semi-parametric approach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] also known as
Canonical MLE (CMLE), and the likelihood function is given by:

Ö
L ( |1, ..,   , ..,  ) =
      </p>
      <p>(1, ..,   , ..,  | ).</p>
      <p>=1</p>
      <p>
        The success of the first approach (IFM) depends on finding
appropriate parametric models for the marginals. If the marginals
are misidentified, the estimated parameter vector  will be biased
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Finally, another simple method, called the method of moments,
is based on the invariance property of Kendall’s  under strictly
increasing transformations of the marginals. The method consists
of calculating Kendall’s  for each bivariate marginal of the
Copula and then using relationship in (Eq.11) to infer an estimate of
the entire correlation matrix  of the considered elliptical Copula
(Gaussian or T) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In the case of T-Copula, to estimate the remaining parameter
, MLE is generally used with correlation matrix held fixed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
(16)
5
      </p>
    </sec>
    <sec id="sec-12">
      <title>PROBLEM FORMULATION</title>
      <p>Our objective is, given a set of complex and representative
observations (e.g. media channels with their user targets and respective
daytime audiences)  , to generate a synthetic dataset  which is
similar to the original dataset  under the following properties.
(13)
(14)
(15)
• For each attribute (variable) in the dataset, the generated
values must be consistent with the distribution of the
variable.
• Dependence between variables must remain the same in
the new dataset.</p>
      <p>This objective can be reformulated as: find automatically the
statistical model that best fits the process of data generation.
Therefore, using Copula and according to Section 4.4, this can
be done by, first, estimating marginals parameters, and, second,
estimating Copula distribution parameters. The fitting will
almost never be exact, so the problem consists of determining the
model parameters that minimize the relative amount of the lost
information.</p>
      <p>
        In the literature, the Akaike Information Criterion (AIC) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
is often used to this extent, but not in the context of automatic
determination of the best marginals or Copula models for data
generation. Noticeably, AIC provides a trade-of between the
goodness of fit and the model’s simplicity by penalizing
proportionally to the number of parameters. This, in turn, allows
decreasing the risk of overfitting and underfitting at the same
time. In what follows, we formulate our problem based on AIC
without loss of generality as any other test could have been used,
such as the Kolmogorov-Smirnov test, which does not penalize
models with more parameters. Based on AIC, our synthetic data
generation problem becomes the following two-steps
optimization problem:
(1) Sampling values consistent with each variable behavior
consists in finding the corresponding marginal
distribution density function ( ,   ) such that:
      </p>
      <p>= 2 − 2 ln(Lˆ (ˆ |  )),  = 1..
where Lˆ (ˆ |  ) = Î</p>
      <p>=1  (  |ˆ ) represents the
maximized likelihood function of a candidate marginal density
 with -dimensional vector of parameters ˆ given by:

ˆ = max Ö  (  ;   ).</p>
      <p>=1
(17)
(18)
where Lˆ ( |1, ..,   , ..,  ) = Î=1 ℎ (1, ..,   , ..,  |ˆ) is
the ML estimation of the model ℎ with parameters  , and
 is the number of parameters. ˆ is given by:</p>
      <p>ˆ = max Ö ℎ (1, ...,   , ...,  ;  ).</p>
      <p>=1
(20)
6</p>
    </sec>
    <sec id="sec-13">
      <title>SOLUTION DESCRIPTION</title>
      <p>This section illustrates the general problem and describes its
solution in the specific context of complex data generation with
multivariate time series paired with categorical variables as found
in our problem of media channel data generation. Our system,
which is called MTCopula, is broken down into three steps: (1)
data preparation, (2) copula model learning, and (3) synthetic
data generation. Noticeably, only step (1) is specific to our
problem, while steps (2) and (3) are entirely generic to any complex
synthetic data generation scenario.
6.1</p>
      <p>Data Preparation
6.1.1 General Pipeline. Copula, as a multivariate distribution
function, requires a continuous representation of independent
and identically distributed -dimensional random variables. Due
to this requirement, the multiple multivariate time series in the
input must be preprocessed before learning the Copula model
that, in a next step, generates synthetic data. Figure 3 illustrates
the diferent steps of our data preparation process.
(2) Characterizing the inter-dependency behavior of variables
together consists in finding the joint distribution density
(copula parameters) (ℎ,  ) that:
  = 2 − 2 ln(Lˆ (ˆ|1, ..,   , ..,  ))</p>
      <p>Our preprocessing process includes data cleaning, which
consists of first removing missing values and normalizing data
representation (ex. lower casing). Then, each column representation
of the multivariate time series data is converted into a row
representation of multiple time series. This allows to change the
observation structure and, as a consequence, allows removing
the dependence due to the time series nature where an
observation at time  depends on previous time slots. In our case, the
multivariate time series is defined by 6 time-dependent variables
– {Women, Men} × {13 − 34 years, 34 − 65 years, 65 + years} –
and two categorical features – the media channel and the day
of the week – as visible in the first table in Figure 3. Each one
will produce a six-time series paired with a vector of three
categorical variables (Target, Channel, and day). As a result of the
preprocessing step, we have a set of independent and identically
distributed observation defined by a vector of continuous and
categorical (discrete) variables as shown in Figure 3.</p>
      <p>6.1.2 Categorical variables encoding and Copulas. Categorical
data cannot be modeled directly by the Copula, so we propose to
replace them with continuous data. To this end, we consider two
options. The first option consists of only considering distribution
based encoding but fails to model the dependence between values
of a categorical variable.</p>
      <p>
        The second option consists of performing first a one-hot
encoding to capture dependence between values of the same categorical
variable. Applying this to the Target variable allows to model the
multivariate dependence between the diferent values of this
variable (women 13-34, men 13-34, women 34-65, men 34-65, women
+65, men +65), and, as a consequence, models the multivariate
time series behavior. The distribution-based encoding technique
is used in order to transfer the discrete representation of the
categorical variable to the continuous representation in the range
[
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. Figure 4 illustrates distribution based encoding technique
using the Truncated Gaussian. This process gives dense areas
at the center of each interval and ensures that the numbers are
well diferentiated. This facilitates the inverse process (decoding),
given a value  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], we can identify the corresponding
category based on the value interval. Once the categorical variables
are transformed, we have a set of observations of d-dimensional
continuous random variables (Table 3 Figure 3). This dataset will
be the input of the next step in order to estimate the copula
parameters.
      </p>
    </sec>
    <sec id="sec-14">
      <title>Copula Model Learning</title>
      <p>As we explain in Section 4.4, the Copula learning process is done
in two steps: the marginal distribution fitting and the Copula
iftting.</p>
      <p>6.2.1 Marginal distribution fiting. Our system proposes two
methods to estimate the marginal distributions. The first one is
non-parametric, via empirical distribution, as described in (Eq.14),
and the second one is parametric and uses MLE (Eq.13).
Algorithm 1 presents the steps of MLE to fit the marginals and, most
importantly, AIC to automate the choice of the best marginal
distribution among a set of preselected distributions. Currently, we
choose, without loss of generality, among the following bounded
distributions: Truncated Gaussian, GaussianKDE (Kernel Density
Estimator), Beta, Truncated Exponential, and Uniform.</p>
      <p>The estimated marginal distributions are used to construct
pseudo-Copula observations via the probability integral
transformation as described in (Eq.15). A model selection criterion,
such as AIC, is used to select the copula  that best fits
pseudoCopula data and characterizes dependence between marginals.
Algorithm 2 presents the steps of Copulas fitting using AIC.</p>
      <p>6.2.2 Copula fiting. Most of the works, done in synthetic
data generation based on Copula, use a Gaussian copula with
MLE approach to estimate marginals. Our system gives
flexibility in terms of Copula model choice based on AIC, which, in
turn, allows learning diferent Copula models and choose the
model which best fits the input data. For the moment, we fit
two models, Gaussian and T-Student Copula, as they are able to
capture diferent dependence structures: linear like the
correlation using Gaussian Copula, and non-linear behavior like the tail
dependency using T-Copula.</p>
      <p>
        Interestingly, our work addresses a recurrent problem
observed when using Copulas: most contributions use Gaussian
copula paired with a Pearson Correlation [
        <xref ref-type="bibr" rid="ref10 ref19">10, 19</xref>
        ] in order to
estimate the correlation factor of the Gaussian Copula.
However, the Pearson correlation factor is not invariant under strictly
monotone non linear transformation, which may impact the
process of estimation when standardizing with marginal distribution
functions. Our contribution MTCopula uses the Kendall’s 
inversion, which is based on the relationship between the Elliptical
Copula (T-Copula or Gaussian Copula) correlation parameter
and the Kendall’s  of two random variables (see Eq.11). For the
T-Copula, another step is required to estimate the degrees of
freedom, which is based on MLE with the correlation matrix held
ifxed.
6.3
      </p>
    </sec>
    <sec id="sec-15">
      <title>Data Generation And Reconstruction</title>
      <p>For synthetic data generation, copula samples are generated by
sampling from the Copula density function  that corresponds</p>
      <sec id="sec-15-1">
        <title>Algorithm 2: Copula Fitting with AIC.</title>
        <p>Input: Dataset  of  observations from a d-dimensional vector
 , a method  (e.g.: Kendall  inversion) for parameters
estimation and marginal distributions 1, . . . ,  .</p>
        <p>Output: the best fitted copula  with estimated parameters  .
1 copulas = { Gaussian Copula, T-Copula } ;
2 best_aic = +∞;
3 copula_data = standardize(, 1,. . . ,  );
4 for copula in copulas do
5 iftted_params =  (copula, , method=m);
6 aic = (copula, fitted_params);
7 if aic ≤ best_aic then
8 best_aic = aic;
9  = fitted_params;
10  = copula;
11 end
12 end
to the estimated Copula joint distribution function . Then, the
inverse probability transformation ( −1) is applied to transform
the Copula samples back to the natural distribution of the data
(see Eq.8). Algorithm 3 presents the steps to sample based on
Copula  and fitted marginal distributions (1, 2, . . . ,  ).</p>
      </sec>
      <sec id="sec-15-2">
        <title>Algorithm 3: Sampling Based On Copula</title>
        <p>Input: Best Fitted Copula C with parameters vector  , Fitted
marginal distributions (1, ˆ1), (2, ˆ2), ..., ( , ˆ ).</p>
        <p>Output: synthetic d-dimensional observation e .
1 Sampling d-dimensional copula data  ,  ∼ (c,  );
2 Return e = (1−1 (1, ˆ1), 2−1 (2, ˆ2), ...,  −
 1 ( , ˆ ));</p>
        <p>
          For the moment, our system MTCopula supports two
Copula models: Gaussian and T-Copula. For generating correlated
random variables, our method uses the Cholesky factorization,
which is commonly used in Monte Carlo simulation to produce
eficient estimates of simulated values [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
        <p>Once the synthetic data generation process is finished, a
reconstruction operation is performed in order to re-convert the
categorical variable to its original representation by replacing
interval values with their corresponding, most likely, categories.
Finally, the row representation of the time series is re-transformed
into a column representation.
7</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>EXPERIMENTS</title>
      <p>In this section, we report the experiments that were conducted
to validate MTCopula ability to generate synthetic data2. In
order to evaluate our approach, we answer the following research
questions:
(1) MTCopula relies on the central hypothesis that Copulas
are pertinent to generate synthetic data. To confirm it, we
propose experiments where state-of-the-art generators
(ITS, GADP, MLE, and CMLE) are compared with diferent
Gaussian Copulas and T-Copula. As a Gaussian Copula is
defined by its correlation matrix to model dependency, our
test incorporates several ways to estimate this correlation
matrix: Kendall’s  , Pearson and Spearman coeficients. In
conclusion, this experiment validates the choices of both
Copula and the Kendall’s  .
2The source codes are available at https://github.com/cderunz/MTCopula.
(2) The main bottleneck of methods based on Copula is ()
to be able to choose among the marginal models, and
() to choose among the Copula models that may have
diferent properties to capture the dependency. MTCopula
automatises the process by using the AIC criterion as a
measure to automatically determine the best model either
for marginals or Copula. We show to which extent this
choice is eficient in our context.
(3) Finally, to answer the first question raised in this paper,
we show the eficiency of MTCopula to generate
multiple/multivariate time series based on our initial real
industrial use case on media planning and synthetic media
channels data generation.</p>
      <p>For our experiments, we use the 4 datasets presented in Table 1.
The XYZ dataset was generated using a mixture of Beta and
Gaussian distributions with a correlation between Y and Z only,
in order to simulate complex marginal distributions. The Abalone
and Breast Cancer Wisconsin datasets come from the UCI dataset
platform 3. The AdWanted dataset4 comes from Adwanted Group
company and provides a rich and real use case for our approach
based on media channels. For this specific dataset, the input data,
which is 27000 instances in 10 dimensions, is first preprocessed
following the methodology presented in Section 6.1 for Copula
model learning. This produces a multivariate continuous data set
with 1440 instances of 60 dimensions that we use in our tests.</p>
      <p>Dataset
XYZ</p>
      <p>Abalone</p>
      <p>Breast Cancer
Wisconsin (Diagnostic)</p>
      <p>AdWanted</p>
      <p>
        7.1.1 Copula versus other state-of-art generators. We first
evaluate the ability of the Copula framework to generate synthetic
data that better preserve dependency structure when compared
to the following state-of-the-art approaches: ITS [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], GADP [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
MLE and CMLE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In order to show the Copula framework
efifciency, we couple diferent marginals by changing the copula
itself: either T-Copula or Gaussian copula. For the Gaussian
Copula, we use diferent methods to estimate the correlation matrix
 : Gaussian Copula with Kendall’s  (GCK), Gaussian Copula
with Spearman (GCS), and Gaussian Copula with Pearson (GCP).
      </p>
      <p>We evaluate, on our four datasets, the dependence structure
preservation based on the Root Mean Square Error (RMSE)
between the correlation matrix of the original dataset and the
generated dataset. The lower the RMSE, the better the dependency
structure is captured. The final reported errors, presented in
Table 2, are averaged over 50 runs, except for MLE and CMLE
due to their time computation costs on the three most complex</p>
      <sec id="sec-16-1">
        <title>3https://archive.ics.uci.edu/ml/datasets.php 4The AdWanted dataset is not shareable due to privacy issues</title>
        <p>dataset. We observe clearly that the dependency structure is
better respected with Copulas than with state-of-the-art approaches.
For instance, on the Breast Cancer Wisconsin Dataset, the mean
RMSE of ITS, GADP, MLE, and CMLE are higher than 0.2 when
it is lower than 0.1 for any type of Copulas.</p>
        <p>7.1.2 Choice of Dependency Structure Estimation Method. In
order to validate our choice that Kendall’s  is relevant and
accurate to estimate and preserve dependency structure, we compare
several methods to estimate the correlation matrix  of the
Gaussian Copula: Kendall, Spearman, and Pearson. Noticeably, we limit
our study to Copula whose dependency structure D is expressed
as a correlation matrix.</p>
        <p>From Table 2, we can observe that Kendall, Spearman, and
Pearson methods, for which the  median is between 0.01
to 0.2 depending on the dataset, are significantly more accurate
than RMSE scores for ITS, GADP, MLE, CMLE methods for which
the means are respectively between 0.34 and 0.72, 0.16 and 0.28,
0.44 and 0.89, and between 0.17 and 0.86. We can also observe
that the Gaussian Copula with Kendall performs slightly better
than the Gaussian Copulas with both Pearson and Spearman.</p>
        <p>These results illustrate the robustness and the efectiveness of
Kendall’s method against the others method for correlation
matrix estimation in the specific case of Gaussian copula. Therefore,
our choice of Kendall’s  to capture the dependency structure
is validated both experimentally and theoretically, as illustrated
before in Section 3. The dependency structure estimation method
choice is thus confirmed.</p>
        <p>7.1.3 Impact of the marginal fiting on the quality of data
generation. Figure 5 illustrates a box plot for the variation of the
P-Value of the two 2-Samples Kolmogorov-Smirnov Test, which
determines whether the synthetic attributes values and the real
attributes values are derived from the same distribution. We
notice that for the first 2 variables  and  , the median  -value
(resp. ≈ 0.65 and ≈ 0.50), are above the threshold  = 0.05, so we
cannot reject the null hypothesis, that the synthetic and the real
marginals are derived from the same distribution. Although the
median  -value of  (≈ 0.09) is also slightly larger than  , it is
significantly less accurate than the others. This is due to problems
with the marginal fitting of this distribution. As a consequence,
correlation is impacted between  and  as visible in Figure 6.
The Figure 6 shows that globally data generation using Copula
with structure dependency capture is able to answer our problem,
but the better we fit both marginals and Copula, the more realistic
the generated data are. As a consequence, our problem boils down
to selecting the most efective marginals and Copula models to
generate the most realistic data. That is the goal of our approach
MTCopula, that relies on AIC as described in the next section.
(a) Real Data
(b) Synthetic Data.</p>
        <p>7.2.1 Choice of the marginals. To evaluate the importance
of AIC in selecting the most appropriate marginal distribution
that best fits the behavior of marginal variables, we fit a list of
bounded distributions: Beta distribution, Uniform distribution,
Truncated Exponential, Truncated Gaussian, and Kernel density
estimation, using the MLE method for each variable. For each of
these distributions, we evaluate the AIC using the fitted
parameters. The distribution with the minimum value of AIC is selected
to model the behavior of the variable. Note that we use a list of
bounded distribution in order to avoid generating outliers. In
addition, we incorporate a Kernel density estimation algorithm
to fit more complex distribution shapes. Table 3 illustrates the
evaluation of AIC of the marginal distributions fitting of XYZ
dataset variables.</p>
        <p>From Table 3 we can observe that for both  and  variables.
Beta distribution has a very small value of AIC (−11718.86 and
−11001.61 respectively ). As a consequence, we notice that the
real data distribution (blue color in Figure 7) and the fitted
distribution (orange color Figure 7) are almost identical (see Figures
7a and 7b). While, for the variable  , the value of the minimum
AIC is not as small (4435.44) compared to the other variables. As
a result, we observe a significant diference between the fitted
and the real data distribution in Figure 7c. This is because AIC
estimates the relative amount of information lost by a given model:
the less information a model loses, the higher the quality of that
model.</p>
        <p>7.2.2 Choice of the copula models. In this experiment, we
investigate the impact of the copula model choice on the quality
(a) X Fitting using
Beta distribution
(b) Y Fitting using
Beta distribution
(c) Z Fitting using
KDE distribution
of data generation, and we demonstrate the importance of AIC
to choose the best copula model. To this end, we fit two copulas
models, the Gaussian and the T-Copula, on two diferent datasets
XYZ and Abalone. For both models, we use the Kendall method
to estimate the correlation matrix  . The degree of freedom 
of T-Copula is estimated by the CMLE method with correlation
matrix  held fixed. Results are averaged after 10 runs. Figure 8
illustrates the RMSE evaluation of the dependency preservation
using the two copulas.</p>
        <p>(a) RMSE Variation XYZ
(b) RMSE Variation Abalone</p>
        <p>From Figure 8a), we can observe that, for XYZ dataset, the
Gaussian Copula performs better than the T-Copula. On the other
side, as shown in Figure 8b, T-Copula outperforms the Gaussian
Copula on Abalone dataset. This is because XYZ dataset does not
expose a tail dependence structure (see Figure 6a). Consequently,
the use of T-Copula will impact the correlation matrix (see eq.
10) by considering dependencies in the tails that do not appear
in original data. Conversely, Abalone dataset shows a lower tail
dependence structure as illustrated in Figure 9a. As a result, using
a T-Copula for data generation will correct the dependencies in
tails, while it is not the case with the Gaussian Copula. For the
moment, we use the T-Copula only for tail dependence modeling,
which has a symmetric tail structure, the reason for which, we
do not control the upper tail structure in the generated synthetic
data as shown in Figure 9b. Results in Table 4 confirm those
conclusions. For   dataset, the   that best fits the data
corresponds to the Gaussian Copula (3993.73). On the other hand,
the T-Copula has the minimal value of  that best fits Abalone
dataset (9507.26). This confirms the AIC interest in choosing the
best copula model that best fits the data generation process.
The objective of this experimentation is to measure the
efectiveness of MTCopula on real media dataset as provided by
 company. According to Table 4, as AIC for T-Copula
(127) is lower than AIC for the Gaussian Copula (202),
MTCopula is capable to automatically select the T-Copula for this
dataset to sample synthetic multivariate time series. These data
will be used in the following experiments to evaluate the
businessrelated qualities of the generated data. The results, in terms of
RMSE, presented in Table 1, confirm this choice, as T-Copula
obtain a slightly better performance: ≈ 0.088 with standard
deviation ≈ 0.0005 for T-Copula and ≈ 0.093 with standard deviation
≈ 0.002 for Gaussian Copula with Kendall’s  .</p>
        <p>To study the utility of the generated time series, we compare
each time series in the generated dataset with its counterpart
from the same target user category, the same day in the week,
and the same channel in the real data set. For each pair, we
measure the MAE variation of the statistical properties of time series,
respectively the Min, Max, Mean, Median, Standard deviation,
and 95 Percentile. Figure 10 shows the MAE of those measures.
From this Figure, we can observe an overall variation smaller
than 0.2, which is a very good result as it is significantly smaller
than the observed standard deviation of those statistics in the
original dataset (respectively ≈ 1.66, ≈ 0.54, ≈ 0.46, ≈ 0.44, and
dard deviation ≈ 0.2, this result reflects the ability of
to preserve the time series’s characteristics when generating
synthetic data. This overall good business-related performance gives
guarantees on the utility of the synthetic time series in several
situations when access to the real data is not possible.
tics.
8</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSION</title>
      <p>This paper proposed MTCopula a flexible, extendable, and generic
solution for synthetic complex data generation. It incorporates
diferent Copula models (for the moment Gaussian and T-Copula)
in order to capture diferent dependency structures including tail
dependence. To bypass the non invariance problem of
PearsonCorrelation based Copula methods, MTCopula involves Kendall
 , which is robust to outliers and invariant under strictly
monotone transformations. This ensures dependency preservation
during the process of copula learning. Unlike the GADP approach
that uses only the Gaussian distribution to model the marginals,
our solution incorporates a variety of bounded distribution in
order to best fit the behavior of variables and do not generate
outliers. In addition, MTCopula is less restrictive in terms of the
quantity of the input data and is more explainable than GANs.
MTCopula is able to automatically select both the univariate
marginal distributions and the copula model that best fit the input
data. For that, it uses MLE to fit the possible marginal distribution
model, and then AIC to choose both the best distribution and the
best Copula Model between the T-Copula and the Gaussian one.
MTCopula handles multiple data types including complex
tabular datasets and multiple/multivariate time series. The proposed
experiments show MTCopula’s interest and eficiency compared
to existing methods.</p>
      <p>In our future works, first, further experiments will be
conducted to evaluate () the sensitivity of MTCopula to the number
of parameters it has to fit to correctly estimate the marginals or
Copula models, by varying the number and the nature of the
variables, () how it deals with asymmetric tail dependency
behaviors as this problem is still open in MTCopula. Second, we
will work on making our approach robust to missing values in
the original datasets. Third, we plan to study the use of synthetic
data for machine learning model fitting, in order to see how
qualitative is the new data for diferent tasks. Fourth, an important
way to see how much using MTCopula could be interesting for
machine learning tasks is also to analyze its scalability according
to the number of original and generated data. Fifth, we want to
tackle a new research problem: how can MTCopula eficiently
consider conditional dependencies between variables. Using Vine
Copula seems to be a promising solution that we need to study.</p>
    </sec>
    <sec id="sec-18">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is funded by the ANRT CIFRE Program (2019/0877).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ruzanna</given-names>
            <surname>Ab</surname>
          </string-name>
          Razak and
          <string-name>
            <given-names>Noriszura</given-names>
            <surname>Ismail</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Dependence Modeling and Portfolio Risk Estimation using GARCH-Copula Approach</article-title>
          .
          <source>Sains Malaysiana</source>
          <volume>48</volume>
          ,
          <issue>7</issue>
          (
          <year>2019</year>
          ),
          <fpage>1547</fpage>
          -
          <lpage>1555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hirotugu</given-names>
            <surname>Akaike</surname>
          </string-name>
          .
          <year>1974</year>
          .
          <article-title>A new look at the statistical model identification</article-title>
          .
          <source>IEEE transactions on automatic control 19</source>
          ,
          <issue>6</issue>
          (
          <year>1974</year>
          ),
          <fpage>716</fpage>
          -
          <lpage>723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Czado</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Analyzing Dependent Data with Vine Copulas</article-title>
          . Lecture Notes in Statistics, Springer (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Demarta and Alexander J McNeil</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The t copula and related copulas</article-title>
          .
          <source>International statistical review 73</source>
          ,
          <issue>1</issue>
          (
          <year>2005</year>
          ),
          <fpage>111</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Genest</surname>
          </string-name>
          , Kilani Ghoudi, and
          <string-name>
            <given-names>L-P</given-names>
            <surname>Rivest</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>A semiparametric estimation procedure of dependence parameters in multivariate families of distributions</article-title>
          .
          <source>Biometrika</source>
          <volume>82</volume>
          ,
          <issue>3</issue>
          (
          <year>1995</year>
          ),
          <fpage>543</fpage>
          -
          <lpage>552</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Harry</given-names>
            <surname>Joe</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Asymptotic eficiency of the two-stage estimation method for copula-based models</article-title>
          .
          <source>Journal of multivariate Analysis</source>
          <volume>94</volume>
          ,
          <issue>2</issue>
          (
          <year>2005</year>
          ),
          <fpage>401</fpage>
          -
          <lpage>419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Harry</given-names>
            <surname>Joe</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dependence modeling with copulas</article-title>
          . CRC press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Kotz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Saralees</given-names>
            <surname>Nadarajah</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Extreme value distributions: theory and applications</article-title>
          . World Scientific.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Dorota</given-names>
            <surname>Kurowicka and Roger M Cooke</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Uncertainty analysis with high dimensional dependence modelling</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jialin</given-names>
            <surname>Fu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources</article-title>
          . arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>09471</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Donald</surname>
            <given-names>MacKenzie</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Taylor</given-names>
            <surname>Spears</surname>
          </string-name>
          .
          <year>2014</year>
          . '
          <article-title>The formula that killed Wall Street': The Gaussian copula and modelling practices in investment banking</article-title>
          .
          <source>Social Studies of Science</source>
          <volume>44</volume>
          ,
          <issue>3</issue>
          (
          <year>2014</year>
          ),
          <fpage>393</fpage>
          -
          <lpage>417</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alexander J McNeil</surname>
            ,
            <given-names>Rüdiger</given-names>
          </string-name>
          <string-name>
            <surname>Frey</surname>
            , and
            <given-names>Paul</given-names>
          </string-name>
          <string-name>
            <surname>Embrechts</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Quantitative risk management: concepts, techniques and tools-revised edition</article-title>
          . Princeton university press.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Mikosch</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Copulas: Tales and facts</article-title>
          .
          <source>Extremes</source>
          <volume>9</volume>
          ,
          <issue>1</issue>
          (
          <year>2006</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Krishnamurty</surname>
            <given-names>Muralidhar</given-names>
          </string-name>
          , Rahul Parsa, and
          <string-name>
            <given-names>Rathindra</given-names>
            <surname>Sarathy</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>A general additive data perturbation method for database security</article-title>
          .
          <source>management science 45</source>
          ,
          <issue>10</issue>
          (
          <year>1999</year>
          ),
          <fpage>1399</fpage>
          -
          <lpage>1415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Roger</surname>
            <given-names>B</given-names>
          </string-name>
          <string-name>
            <surname>Nelsen</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>An introduction to copulas</article-title>
          . Springer Science &amp; Business Media.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Aristidis</surname>
            <given-names>K Nikoloulopoulos</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harry Joe</surname>
            , and
            <given-names>Haijun</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Extreme value properties of multivariate t copulas</article-title>
          .
          <source>Extremes</source>
          <volume>12</volume>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>129</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Sheehan</given-names>
            <surname>Olver</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alex</given-names>
            <surname>Townsend</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Fast inverse transform sampling in one and two dimensions</article-title>
          .
          <source>arXiv preprint arXiv:1307.1223</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Noseong</given-names>
            <surname>Park</surname>
          </string-name>
          , Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and
          <string-name>
            <given-names>Youngmin</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Data Synthesis Based on Generative Adversarial Networks</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          ,
          <issue>10</issue>
          (
          <year>2018</year>
          ),
          <fpage>1071</fpage>
          -
          <lpage>1083</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Neha</surname>
            <given-names>Patki</given-names>
          </string-name>
          , Roy Wedge, and
          <string-name>
            <given-names>Kalyan</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The synthetic data vault</article-title>
          .
          <source>In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)</source>
          . IEEE,
          <fpage>399</fpage>
          -
          <lpage>410</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Patton</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Copula methods for forecasting multivariate time series</article-title>
          .
          <source>In Handbook of economic forecasting</source>
          . Vol.
          <volume>2</volume>
          . Elsevier,
          <volume>899</volume>
          -
          <fpage>960</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Petricioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Humski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vranić</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pintar</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Data Set Synthesis Based on Known Correlations and Distributions for Expanded Social Graph Generation</article-title>
          .
          <source>IEEE Access</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ),
          <fpage>33013</fpage>
          -
          <lpage>33022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Stéphanie</given-names>
            <surname>Portet</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A primer on model selection using the Akaike information criterion</article-title>
          .
          <source>Infectious Disease Modelling</source>
          <volume>5</volume>
          (
          <year>2020</year>
          ),
          <fpage>111</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Alexander J Ratner</surname>
            , Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Ré</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning to compose domain-specific transformations for data augmentation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>3236</volume>
          -
          <fpage>3246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Jerome</surname>
            <given-names>P Reiter</given-names>
          </string-name>
          , Quanli
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>Biyuan</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Bayesian estimation of disclosure risks for multiply imputed, synthetic data</article-title>
          .
          <source>Journal of Privacy and Confidentiality</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Marko</given-names>
            <surname>Robnik-Šikonja</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Data generators for learning systems based on RBF networks</article-title>
          .
          <source>IEEE transactions on neural networks and learning systems 27, 5</source>
          (
          <year>2015</year>
          ),
          <fpage>926</fpage>
          -
          <lpage>938</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>David</given-names>
            <surname>Salinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bohlke-Schneider</surname>
          </string-name>
          , Laurent Callot, Roberto Medico, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Gasthaus</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>High-dimensional multivariate forecasting with lowrank Gaussian Copula Processes</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>6827</volume>
          -
          <fpage>6837</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Serinaldi</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Analysis of inter-gauge dependence by Kendall's  K, upper tail dependence coeficient, and 2-copulas with application to rainfall ifelds</article-title>
          .
          <source>Stochastic Environmental Research and Risk Assessment</source>
          <volume>22</volume>
          ,
          <issue>6</issue>
          (
          <year>2008</year>
          ),
          <fpage>671</fpage>
          -
          <lpage>688</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M</given-names>
            <surname>Sklar</surname>
          </string-name>
          .
          <year>1959</year>
          .
          <article-title>Fonctions de repartition an dimensions et leurs marges</article-title>
          .
          <source>Publ. inst. statist. univ. Paris</source>
          <volume>8</volume>
          (
          <year>1959</year>
          ),
          <fpage>229</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Natasa</surname>
            <given-names>Tagasovska</given-names>
          </string-name>
          , Damien Ackerer, and
          <string-name>
            <given-names>Thibault</given-names>
            <surname>Vatter</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>6528</volume>
          -
          <fpage>6540</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Honggang</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , LM Zhang, Te Xiao, and
          <string-name>
            <given-names>XY</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Generation of multivariate cross-correlated geotechnical random fields</article-title>
          .
          <source>Computers and Geotechnics</source>
          <volume>86</volume>
          (
          <year>2017</year>
          ),
          <fpage>95</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>