<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. Giobergia);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Dataset Generation for Non-Trivial Regression Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Flavio Giobergia</string-name>
          <email>flavio.giobergia@polito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Savelli</string-name>
          <email>claudio.savelli@polito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We propose a novel method for generating synthetic regression datasets aimed at educational and evaluative settings. Unlike standard synthetic data generation approaches, which sample inputs from a predefined distribution and compute targets via a fixed function, our method optimizes the input data directly. Given a fixed target vector and a randomly initialized, frozen nonlinear model, we perform gradient-based optimization over the input features to match the targets. To avoid trivial solutions, we introduce an additional loss term that explicitly penalizes the performance of a naive baseline model, such as linear regression. The resulting datasets are guaranteed to exhibit nonlinear structure while remaining controllable, reproducible, and interpretable. We further show how to project the optimized continuous inputs into mixed-type feature spaces, including numerical, ordinal, and categorical variables. Experimental results demonstrate that the proposed approach produces datasets that are solvable by nonlinear models but systematically challenging for linear ones, making them particularly suitable for educational purposes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Synthetic datasets are widely used in the teaching of regression and machine learning, particularly in
examinations and practical assignments. Instructors often require small, previously unseen datasets
that can be solved within limited time constraints while still exhibiting nontrivial structure. However,
commonly adopted generation strategies – such as sampling from linear or mildly nonlinear functions
with noise – tend to yield problems that are either trivial for basic models or insuficiently challenging.</p>
      <p>
        More advanced synthetic data generation techniques, including generative adversarial networks
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] and variational autoencoders [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], aim to reproduce the statistical properties of real-world
data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        While synthetic data generation has gained increasing attention for its applications in
benchmarking [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], privacy preservation [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], and data augmentation [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], these methods
are poorly suited for educational use: they ofer limited interpretability, provide little control over the
relative performance of diferent model classes, and often produce datasets whose structure is dificult
for students to reason about.
      </p>
      <p>In this work, we adopt a complementary perspective. Rather than fixing the input features and
generating targets via a predefined mapping, we fix both the target values and the underlying nonlinear
process and optimize the input features so that the desired targets are obtained when passed through the
process. Input values are initialized randomly and iteratively refined via gradient-based optimization.
To avoid trivial solutions, we introduce an auxiliary loss term that explicitly penalizes the performance
of a naive baseline model, such as linear regression.</p>
      <p>Finally, we transform the resulting numerical dataset into one that includes ordinal and categorical
attributes. This step serves a dual purpose: it introduces additional structured information loss, leading
to unexplained variance that does not stem from arbitrary noise injection, and it increases the realism
and complexity of the resulting tabular data.</p>
      <p>The proposed procedure produces regression datasets that:
1. exhibit a clear but nonlinear relationship between inputs and targets;
Published in the Proceedings of the Workshops of the EDBT/ICDT 2026 Joint Conference (March 24-27, 2026), Tampere, Finland</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
2. are intentionally challenging for naive baseline models;
3. include mixed feature types commonly encountered in applied settings;
4. are reproducible and controllable in terms of size and dificulty.</p>
      <p>Overall, we propose a diferentiable dataset-generation framework that directly optimizes input
features under pedagogically motivated constraints, yielding learnable yet nontrivial datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The objective of the proposed methodology is to generate a dataset
{(x ,  
)}=1 ,
where  is a user-defined dataset size, x ∈  denotes a vector of independent variables, and   ∈ ℝ is a
continuous target value. While a variety of synthetic data generation strategies exist, we impose a set
of constraints motivated by pedagogical considerations:
1. Existence of an underlying data-generating process. We assume the existence of a function  (⋅) such
that   =  ( x ). This ensures that the regression task is well-defined and, in principle, solvable.
2. Non-triviality of the generating process. Simple linear or weakly nonlinear relationships often
yield datasets that are trivially solved by basic models. We instead require a nonlinear process
that cannot be adequately approximated by naive baselines.
3. Presence of unexplained variance. In real-world datasets, perfect prediction is rarely achievable
due to latent or unobserved variables. Rather than injecting artificial noise, we aim to induce
unexplained variance by removing information, thereby reflecting missing or inaccessible features.
4. Mixed-type tabular structure. The generated datasets should include numerical, ordinal, and
categorical attributes, reflecting the structure of many applied tabular learning problems.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Generation of Non-Trivial Datasets</title>
      <p>To address the first three desiderata, we define a nonlinear generation process
randomly initialized and kept fixed throughout the procedure. We also fix a vector of target values
  , whose parameters are
y = { 1, … ,   },
primary loss function is defined as:
where each   is sampled from a user-specified distribution (e.g., Gaussian or uniform).</p>
      <p>Input features x ∈ ℝ are initialized randomly and treated as learnable variables. For convenience,
we denote by X tha matrix stacking all x row-wise. These learnable variables are optimized via gradient
descent to minimize the discrepancy between   (x ) and   , while keeping both   and y fixed. The
the performance of a naive baseline regressor.
y. The corresponding mean squared error is:</p>
      <p>Optimizing this loss alone may lead to degenerate solutions in which simple linear models achieve
competitive performance. To counteract this efect, we introduce an auxiliary loss term that penalizes
Let  = (X⊤X)−1X⊤y denote the closed-form solution of a linear regression model trained on X and
The final optimization objective is:
ℒgen = 1</p>
      <p>‖  (X) − y‖2 .
ℒlin = 1</p>
      <p>‖X − y‖2 .
ℒ = ℒgen − ℒ lin,
(1)
(2)
(3)
where  ≥ 0 controls the extent to which linear solutions are penalized. Larger values of  yield datasets
that are increasingly dificult for naive models, while remaining perfectly learnable by the underlying
nonlinear process.</p>
      <p>While we focus on linear regression as the baseline model – given its canonical role in introductory
regression analysis – the same framework naturally extends to polynomial or other parametric baselines
via appropriate feature transformations.</p>
      <sec id="sec-3-1">
        <title>3.1. Latent Variables via Feature Removal</title>
        <p>resulting in partial information loss.</p>
        <p>The optimization procedure produces a dataset X that, by construction, allows   to accurately recover
the target values y. To model the presence of latent or unobserved variables, we remove a subset of
 ′ ≤  features from X. These features are used internally by   but are not included in the final dataset,</p>
        <p>We define the latent information ratio as  =  ′/ . The observed dataset X′ is obtained by applying a
projection that removes the latent dimensions. This mechanism induces unexplained variance without
injecting stochastic noise, reflecting missing information rather than randomness.</p>
        <p>Through these steps, we satisfy the first three desiderata:
1. the existence of an underlying data-generating process;
2. the non-triviality of the input–output relationship;
3. the presence of unexplained variance due to latent variables.
inverse of one-hot encoding.
of the initial feature space is therefore:
representation. Let  
attributes, respectively.
3.2. Conversion to Ordinal and Categorical Attributes
To satisfy the final desideratum, we transform the numerical dataset X′ into a mixed-type tabular
,   , and   denote the desired number of numerical, ordinal, and categorical
We first select</p>
        <p>columns of X′ that are left unchanged, yielding the numerical attributes. Next,
  columns are discretized using standard techniques such as equal-width binning, equal-frequency
binning, or  -means discretization. Each discretized attribute may use a diferent number of bins,
allowing for varying levels of granularity. Since the bins correspond to ordered intervals of a continuous
variable, the resulting attributes are naturally ordinal.</p>
        <p>Categorical attributes are generated by applying an arg max operation to disjoint subsets of columns.
To encode a categorical attribute  with   possible values, we allocate   columns and assign each
instance to the index of the column with the maximum value. This operation can be interpreted as the
To generate all   categorical attributes, a total of ∑= 1   columns is required. The total dimensionality
 =  ′ +   +   + ∑   .</p>
        <p>=1
(4)</p>
        <p>Finally, we note that the conversion to ordinal and categorical attributes introduces additional,
structured information loss that further contributes to unexplained variance. While the primary driver
of latent information is the ratio  , these lossy transformations play a secondary role. Accurately
modeling their impact is nontrivial; in practice, we recommend an empirical approach validated against
baseline model performance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Additional post-processing</title>
        <p>The obtained dataset, after the reported steps, is characterized by numerical, ordinal and categorical
features, and a target that depends on these features.</p>
        <p>It is possible to introduce additional characteristics to the dataset to make it more useful from a
pedagogical perspective. Since the focus of the proposed work is to produce a “base” dataset, the
post-processing steps are left as an additional activity to be carried out as needed. Some examples are
reported below.</p>
        <p>Feature dependence. The assumption of independence of the variables is a simplifying step that
makes the learning objective more straightforward. Adding dependence among the variables is possible
either by introducing additional objectives (as a part Equation 3), or by introducing a post-processing
step that generates combinations of existing variables.</p>
        <p>Noise and outliers. Noise can be added to the existing data points (e.g., at the row- or column-level),
to introduce decorrelation with the target variable. In addition, outlier points can be randomly generated
and introduced. The fraction of noise and outliers introduced will, of course, afect the performance
that can be obtained on the task.</p>
        <p>Missing values. Similarly to noise, missing values can be introduced to enrich the pre-processing
that needs to be applied to the dataset. Noise can either be added (1) at the row- or column-level, or (2)
uniformly throughout the dataset, or (3) in a way that is correlated with the target feature.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, we present the experimental results obtained by generating datasets with the proposed
method. First, in Section 4.1, we present the performance of various regression models on datasets
generated using the proposed method and compare it with other baselines. Next, in Section 4.2, we show
how the hyperparameters  and  afect the obtained results. We present qualitative results, including
low-dimensional ones, in Section 4.3. Finally, in Section 5, we present results from a real-world use case
in which we employed a generated dataset as part of an exam, involving 143 participants.</p>
      <sec id="sec-4-1">
        <title>4.1. Main results</title>
        <p>
          We compare the datasets obtained with the proposed method with those generated using the commonly
adopted scikit-learn [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] library via the make_regression function. This method generates random
attributes by sampling them from Gaussian distributions, producing an output as a linear combination
of the independent variables with additional Gaussian noise (with standard deviation 1.0). We include
make_regression not as a strong baseline, but as a commonly used educational tool. For the proposed
approach, we set  = 0.15 and  = 0 . For   , we use a neural network with 5 linear layers interleaved
with ReLU functions. More details on the actual implementation are available in the GitHub repository1.
        </p>
        <p>For both methods, we generate  = 10, 000 samples and  = 100 attributes. For a fair comparison,
we leave all the numerical features generated with our methodology. We generate 10 separate datasets
with both techniques, and test them against various commonly adopted regression methods: Decision
Trees and Random Forests, K-Nearest Neighbors, and Linear Regression (also with two regularization
techniques: Ridge and Lasso). In all cases, we use a 70/30 train/test split to ensure a fair evaluation and
evaluate the results using the  2 score.</p>
        <p>Table 1 shows the results. We emphasize that, unlike common situations where “higher is better”, in
this case we are not interested in achieving high performance, but rather in producing a dataset that is
interesting from a pedagogical perspective.</p>
        <p>With this in mind, the results are notable: first, as expected, linear regression achieves the best
performance on the scikit-learn datasets, with an  2 ≈ 1. This occurs because their data-generation
process is, by design, linear. Instead, it is clear that the proposed generation approach, by design,
renders linear regression methods inefective (  2 ≈ 0).
1https://github.com/fgiobergia/synth-datagen
1.0
0.8</p>
        <p>Other regressors also work well for the proposed method. An additional insight emerges from the
behavior of K-Nearest Neighbors regression. KNN relies on the assumption that nearby points in feature
space correspond to similar target values and is therefore sensitive to the presence of meaningful local
structure. On datasets generated with make_regression, KNN performs poorly, achieving a negative
 2, indicating that local neighborhoods in the high-dimensional feature space are largely uninformative.
In contrast, KNN achieves strong performance on datasets generated with the proposed method. This
suggests that the optimization-based generation procedure induces a feature space in which locality is
more informative, despite the absence of an explicit locality constraint in the objective.
4.2. Influence of  , 
In this section, we discuss the efect of the two main hyperparameters adopted,  and  .</p>
        <p>Figure 1 shows the performance of various regressors, as well as a linear regressor, as  increases.
As a reminder,  serves as a parameter that governs how “dificult” the task is to solve with a linear
model. We show how  = 0 generates a very simple task that can be perfectly solved by all models.
Instead, for  &gt; 0 , the performance for the linear model is actively penalized during the training, as
shown by the sharp drop in performance. Decision trees, which are known to be very simple models,
are also afected by this additional complexity. Other, more robust, models are instead unafected. This
behavior can enable educators to produce datasets that work well for a given model but not for others,
thereby encouraging students to explore multiple possible models for their solution.</p>
        <p>Figure 2 instead shows the performance of the two approaches as  varies. We apply the same
approach (i.e., incrementally remove features) to make_regression as well. The result is interesting:
1.0
0.5
e
rco 0.0
2SR
0.5
1.0
Generation method
Our method
make_regression
Model
Decision Tree
K-Nearest Neighbors
0.0
decision trees and KNN both have a drop in performance for large values of  . With the exception of
very large values of  , however, the datasets still remain “useful” (i.e.,  2 &gt; 0). Instead, for the models
tested on the scikit-learn dataset, performance drops sharply for small values of  , rendering the models
unusable ( 2 &lt; 0) when partial information is removed.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Qualitative results</title>
        <p>In Figure 3, we qualitatively compare three datasets generated with the proposed method and three
generated using make_regression. All datasets consist of 10,000 samples and 100 features, which
are projected to two dimensions using Principal Component Analysis. The PCA projections of the
make_regression datasets consistently form compact, isotropic clusters, a behavior attributable to
the Gaussian distribution of the input features. This pattern is stable across runs, indicating limited
structural variability in the generated data.</p>
        <p>In contrast, datasets produced by the proposed methodology exhibit markedly diferent geometric
structures across runs. In all cases, the PCA projections reveal two approximately orthogonal directions
of high variance, suggesting the presence of structured, non-linear relationships in the data. We attribute
this behavior to the auxiliary loss term, which penalizes linear solutions and encourages nonlinearly
solvable structures. To further support this interpretation, Figure 4 compares PCA projections obtained
with  = 0 and  = 0.15 . When  = 0 , the resulting dataset exhibits a more linear and homogeneous
structure, whereas a positive baseline penalty yields more pronounced, structured variance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Real-world Use case</title>
      <p>
        To assess the practical usefulness of the proposed dataset generation methodology, we adopted the
approach to generate a dataset to be used for a real-world evaluative setting. The dataset has been used
as a part of an exam for a Master’s course in data science, in addition to a written test, and participation
to a competition-like project [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The task for this part of the exam consisted in addressing a regression
problem, by designing and training a predictive model under time and resource constraints (90 minutes,
with access to a virtual machine and Python libraries that are commonly adopted in data science).
      </p>
      <p>The generated dataset contained 10 continuous, 5 ordinal, and 6 categorical attributes, and was split
into 5,700 development samples (training and validation) and 1,400 test samples. Participants were
provided with the development split (with target values) and evaluated on the held-out test set using
the  2 metric.</p>
      <p>A total of 143 participants took part in the exam. Of these, 52 managed to submit a working solution
in the allotted time (students are allowed to submit working solutions after the exam, with a penalty
proportional to the number of changes w.r.t. the solution submitted during the exam). Figure 5 reports
the distribution of  2 scores obtained on the test set. The left panel shows the overall performance
distribution, which is highly concentrated, indicating that the task is learnable using appropriate
modeling choices. At the same time, performance does not trivially saturate, as reflected by the spread
of scores and the presence of a small number of low-performing solutions.</p>
      <p>The right panel focuses on the high-performance region of the distribution and reveals that even
among top-performing solutions, the achieved  2 values exhibit meaningful variability. This result
is particularly relevant in an evaluative context, as it demonstrates that the generated dataset can
discriminate between solutions, rather than collapsing to the same performance.</p>
      <p>Overall, this deployment confirms that the proposed generation method produces datasets that are
both accessible and non-trivial in practice. The controlled nonlinearity of the task, combined with
mixed feature types and latent information, yields a regression problem that admits multiple viable
solutions while still rewarding careful model selection and design.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The proposed method is not intended to produce realistic or privacy-preserving data. Instead, it provides
a controlled mechanism for constructing datasets with known and adjustable properties. This makes it
particularly suitable for educational settings, where clarity, control of dificulty, and reproducibility are
more important than fidelity to real-world distributions.</p>
      <p>Non-triviality is enforced structurally through explicit optimization objectives and information
removal, rather than through arbitrary noise injection. As a result, the generated datasets remain
learnable while avoiding degenerate or saturated solutions, as supported by both experimental results
and real-world deployment.</p>
      <p>We note, however, that the method has limitations. Dataset properties depend on the choice of
the frozen nonlinear model, and the optimization procedure provides no theoretical guarantees of
global optimality. These limitations are acceptable in the intended use cases, where controllability and
practical efectiveness take precedence over formal guarantees.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We introduced a diferentiable approach to synthetic dataset generation that optimizes input features
to satisfy pedagogical constraints. By combining a frozen nonlinear model with an explicit
baselinepenalizing objective, we generate regression datasets that are learnable yet non-trivial. The method is
simple to implement, highly controllable, and well-suited for use in exams and teaching. We show the
advantages of the proposed approach over existing baselines, and successfully adopted the process to
generate datasets relevant to data science exams. Future extensions will mainly aim to address problems
other than regression (e.g., classification, anomaly detection, and clustering). We note that each of
these families of problems have diferent characteristics and constraints – making the extension of
the current work non-obvious. We hope this work encourages further exploration of purpose-built
synthetic data for educational and evaluation purposes.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding
from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA
(PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555 11/10/2022, PE00000013).
This manuscript reflects only the authors’ views and opinions, neither the European Union nor the
European Commission can be considered responsible for them.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT (GPT 5.2) in order to: Drafting content.
After using this tool, the author reviewed and edited the content as needed and takes full responsibility
for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Generative adversarial nets,
          <source>Advances in neural information processing systems</source>
          <volume>27</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skoularidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuesta-Infante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          ,
          <article-title>Modeling tabular data using conditional gan</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          ,
          <article-title>Synthesizing tabular data using generative adversarial networks</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>11264</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Auto-encoding variational bayes</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Haug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          , G. Kasneci,
          <article-title>Deep neural networks and tabular data: A survey</article-title>
          ,
          <source>IEEE transactions on neural networks and learning systems 35</source>
          (
          <year>2022</year>
          )
          <fpage>7499</fpage>
          -
          <lpage>7519</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Orzechowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers</article-title>
          ,
          <source>arXiv preprint arXiv:2107.06475</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <article-title>A synthetic benchmark to explore limitations of localized drift detections</article-title>
          , in: International Workshop on Discovering Drift Phenomena in Evolving Landscapes, Springer,
          <year>2024</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Malin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. F.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Generating multi-label discrete patient records using generative adversarial networks, in: Machine learning for healthcare conference</article-title>
          , PMLR,
          <year>2017</year>
          , pp.
          <fpage>286</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Savelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          , FAME:
          <article-title>Fictional actors for multilingual erasure</article-title>
          ,
          <source>in: Proceedings of the Fifteenth Language Resources and Evaluation Conference, European Language Resources Association</source>
          ,
          <year>2026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Frid-Adar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Diamant</surname>
          </string-name>
          , E. Klang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amitai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Greenspan</surname>
          </string-name>
          ,
          <article-title>Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>321</volume>
          (
          <year>2018</year>
          )
          <fpage>321</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Borra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Savelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          , Malto at semeval
          <article-title>-2024 task 6: Leveraging synthetic data for llm hallucination detection</article-title>
          ,
          <source>in: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1678</fpage>
          -
          <lpage>1684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>the Journal of machine Learning research 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pasini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ventura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Apiletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerquitelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiusano</surname>
          </string-name>
          ,
          <article-title>Dsle: a smart platform for designing data science competitions</article-title>
          ,
          <source>in: 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>