<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Assurance Procedure for Fair Data Augmentation in Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samira Maghool</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Ceravolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Berto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milan, Department of Computer Science</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In addressing the limited availability of data for predictive purposes with machine learning, we are concerned with potential biases arising from dataset augmentation. Despite advanced algorithms to generate synthetic data that can preserve the original data distribution, challenges remain, including the risk of perpetuating social biases. Our approach uses a similarity network representation that treats each data point as a node and strategically generates synthetic points near it. A vector label propagation algorithm, complemented by an exponential kernel for adjusting link weights, accurately labels these synthetic points. The primary goal is to reduce the system's dependence on sensitive features without excluding them, thereby avoiding the risk of exacerbating biases or reducing data variation. Implemented in a big data ecosystem, our methodology enables continuous evaluation in an evolving domain, efectively addressing the challenges of data scarcity with a fairness-aware approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Fairness</kwd>
        <kwd>Similarity Network</kwd>
        <kwd>Data Augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The widespread adoption of Machine Learning (ML) technologies across industries has ushered
in a new era of data-driven decision-making [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While ML promises to increase eficiency
and productivity, its application in decision-making processes presents a number of challenges,
ranging from performance to regulatory compliance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Regulatory frameworks, such as the
European Union’s proposed Artificial Intelligence Act, emphasize the importance of fairness and
accountability [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Overcoming these challenges requires industries to establish comprehensive
testing frameworks that evaluate ML models’ performance, reliability, and generalization across
multiple scenarios [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. But developing frameworks for regulatory compliance is a complex task.
As regulations evolve and new data becomes available, industries must establish mechanisms
for continuously monitoring and updating ML models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The diferent techniques used by
designers to achieve specific properties in ML systems can conflict with each other. Fairness
may come at the expense of accuracy, accuracy at the expense of transparency, and privacy
compliance at the expense of explainability [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        In this paper, our contribution is to explore the delicate balance between data augmentation
and fairness in tabular data, with the goal of developing a solution suitable for integration into
a continuous assurance framework. ML model training is inherently data-hungry, requiring
a significant amount of data for accurate results, often necessitating the expansion of the
dataset through augmentation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Traditional data augmentation techniques applied to tabular
datasets focus on creating duplicates and ensuring their consistency with the data distribution.
This is achieved by assigning values through random perturbations or by adhering to central
tendency [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, many established augmentation techniques do not explicitly incorporate
fairness principles. If the training dataset contains social biases, these biases can persist in the
augmented data. For instance, applying central tendency functions to features already biased
due to social or demographic factors may inadvertently mirror and reinforce these biases in the
augmented data, rather than mitigating them.
      </p>
      <p>
        Addressing fairness in the training of AI models requires careful consideration of the complex
issue of discrimination. Discrimination in the context of AI is not a universally applicable concept
but is intricately tied to socially salient categories that have historically been subject to unfair and
systematic disadvantage [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These categories are closely linked to specific subgroups within
the population. We use subgroup to denote a group of individuals defined by a shared value of a
demographic variable such as race, gender, sexual orientation, religion, disability status, etc [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
In the legal frameworks of various states, certain categories are oficially recognized as protected.
The recognition of these socially salient categories underscores the need to consider sensitive
demographic variables when training AI models. Removing or ignoring these variables doesn’t
automatically ensure fairness. In fact, it can lead to unintended consequences, as the model
may still learn and replicate biases present in the training data even when explicit features are
excluded [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Moreover, the impact of bias is not limited to the training process but extends
to the broader deployment and use of AI systems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Biases in training data can manifest
themselves in biased predictions and decisions, perpetuating systemic inequalities. To address
these challenges, a holistic approach to fairness is critical. This includes using fairness-aware
algorithms, conducting thorough bias audits of training data, and continuously assessing the
impact on diferent subgroups.
      </p>
      <p>Therefore, the methodology we have adopted to create data augmentation techniques that
respect the principles of fairness includes the following facets.</p>
      <p>• A data augmentation method that can mimic the distribution of data redressing the
balance between subgroups in a dataset. This brings to an increased representativeness
of underrepresented subgroups.
• Guide the process utilizing the entire feature space, not excluding any variable of the
dataset. This comprehensive exploration ensures a holistic consideration of the data
without penalizing or favoring any subgroups.
• Use incremental procedures to align with continuous assessment frameworks updating
as new data is collected. This incremental approach facilitates ongoing validation and
adjustment of the augmentation process in response to evolving data sets that may contain
diferent distributions.</p>
      <p>
        One of the key points of our methodology is to represent the data set using a similarity
network. Within this network, we consider each data point as a node and strategically generate
synthetic data points in close proximity to the original data points. These synthetic points
form connections and inherit features from their neighboring nodes. In previous studies [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] ,
we observed that this network generates a latent feature space capable of capturing grouping
patterns and latent associations within the dataset. The latent feature space is crucial for
preventing classifiers from being biased by a restricted set of features, reducing the risk of
segmenting the data based on sensitive features. In this work, we extend the capabilities of our
methodology by using the density of the network to identify less represented groups within the
dataset. Generating synthetic data near these groups we naturally balance the representation of
diferent groups in the dataset. To ensure the accurate labeling of these synthetic points, i.e. their
association with the target labels to be predicted by a trained ML model, we use a vector label
propagation algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] , complemented by an exponential kernel for fine-tuning the link
weights. Importantly, we avoid excluding any features, as such exclusions can worsen biases
and compromise data diversity. By incorporating all features into the network, our approach
fosters a comprehensive understanding of the dataset’s complexity, resulting in a more robust
and unbiased representation.
      </p>
      <p>
        To ensure this continuous verification, we deployed the training and verification pipelines of
the ML model using a platform in provided by the MUSA project [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This platform provides
an edge-cloud continuum service infrastructure that delivers high-performance computing
resources while assuring advanced non-functional properties such as privacy and application
security. The solution proposed by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] utilizes contract-based continuous verification using
evidence gathered through transparent monitoring. The data pipelines described in this paper
are executed on this platform, monitoring both their behavior and artifacts, and verifying that
the model’s performance and fairness properties are maintained. In this way, we have evaluated
whether the dataset augmentation method we propose is able to enlarge the data points while
guaranteeing fairness in the ML models trained after the augmentation.
      </p>
      <p>By addressing the challenges associated with data scarcity, our method makes a significant
contribution to the ongoing quest to develop unbiased ML systems that can efectively generalize
across diverse and dynamic datasets. Through this research, we aim to foster the creation of
ML models that not only exhibit fairness, but also maintain high levels of data variation and
representation, ensuring their applicability to a wide range of real-world scenarios. Specifically,
the paper is organized as follows. In Section 2, we discuss related work. In Section 3 we introduce
the basic notions necessary to understand this work. Section 4 presents the proposed approach
and explains how we incorporated it into an assurance procedure. Section 5 discusses in detail
the experimental setup, the steps to be taken from data preparation to model description. In
Section 6, we demonstrate the results we obtained by implementing our proposed method on
a public dataset and evaluation metrics comparing the original dataset with the augmented
version. Finally, after Discussion (Section 7), we conclude this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Three primary approaches to generating synthetic data for augmentation purposes have been
identified in the literature [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]:
      </p>
      <p>
        Generating synthetic data according to a given distribution. This approach involves
generating synthetic data points that match the statistical properties and patterns expected in
the target distribution [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. It uses knowledge of distribution properties, such as normal,
exponential, and chi-square, to generate synthetic samples without relying on actual data points.
Many techniques are built by combining and balancing bootstrapping and perturbation steps.
Bootstrapping involves generating synthetic data by duplicating existing data. Conversely,
perturbation introduces controlled noise or randomization into real data. This step is also
exploited to create synthetic data while maintaining privacy. By changing sensitive variables
or details in the data, synthetic data can be generated that retains the statistical properties
of the original dataset while making re-identification very dificult. A recognized limitation
of duplication-based approaches is their tendency to lead to overfilling [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. To mitigate this
problem, state-of-the-art methods employ density measures to regulate the duplication ratio in
diferent regions of the dataset [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]. Our approach aligns with this strategy. However, to the
best of our knowledge, none of the existing methods evaluate the duplication process using
fairness-aware metrics, as we do.
      </p>
      <p>
        Agent-based modeling (ABM). Addressing the challenge of simulating systems with many
interacting parts that evolve over time, ABM is a robust method for generating synthetic data
that efectively augments historical data [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Conceptually aligned with complex systems,
ABM allows events to emerge from interactions among fully autonomous agents that follow
rules, behaviors, and decision processes. This approach is particularly useful for understanding
the dynamics of complex systems. Its strength lies in providing a high degree of realism and
granularity, capturing emergent phenomena that result from the interactions of individual
agents. Indeed, it has historically been used in the natural sciences [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ]. It is also very
efective as an incremental learning technique, where iteration after iteration refines the global
knowledge of the system based on information acquired at the local level [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. However,
ABM also has some limitations. Developing accurate ABMs can be complex, requiring careful
calibration of parameters, and the computational intensity of simulations can be challenging in
terms of computing power and time [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Validation and verification of ABM results against
real-world data can be dificult due to the dynamic and emergent nature of the results [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. In
addition, ABM often requires detailed data on individual agents, and its sensitivity to parameter
changes highlights the importance of robust sensitivity analysis [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Our method uses an ABM
algorithm but fully exploits the feature space of the target detaset without relying on external
or global data, and it is calibrated on fairness metrics.
      </p>
      <p>
        Generative Adversarial Networks (GANs). Generative Adversarial Networks (GANs) play
an important role in the synthetic data generation landscape, enhancing the ability to generate
data that not only has statistical similarities, but also has visual and contextual similarities
to real-world data [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. At the core of GANs is the paradigm introduced by Goodfellow et
al.[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], which encapsulates the concept of adversarial training. It involves a dynamic interplay
between two key components: the generator and the discriminator. The generator is tasked
with creating synthetic data, attempting to emulate real-world data distributions. At the same
time, the discriminator acts as a classifier, distinguishing between real and synthetic data.
Through iterative training, the generator refines its ability to generate increasingly realistic data,
while the discriminator refines its ability to distinguish between real and synthetic instances.
The adversarial nature of this training process creates a feedback loop that forces both the
generator and the discriminator to continually improve. Despite the efectiveness of GANs
in generating realistic data, they are susceptible to mode collapse, a limitation in which the
generator produces a limited variety of samples. This can prevent the generation of extreme or
diverse data points, limiting the overall efectiveness of the synthetic data generation process.
To address the challenge of mode collapse and increase the diversity of the generated data,
various strategies have been explored. One common approach is to modify the dataset or learn
a fair distribution to mitigate bias [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. In the context of mitigating unfairness using adversarial
learning, it is important to note that the selection and treatment of sensitive features play a
critical role. Typically, studies in this area require the predefinition of sensitive features, and
the algorithm systematically addresses the mitigation of one sensitive feature at a time. This is
a significant limitation that does not apply to our approach, since multiple sensitive features
are considered when generating new examples.
      </p>
      <p>
        While the potential of augmentation algorithms using synthetic data generation is promising,
it is important to acknowledge and address criticisms that have emerged in the literature.
Specifically, concerns have been raised about the potential for these algorithms to inadvertently
promote polarization and biased information [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Such unintended consequences have the
potential to undermine trust in ML systems [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. This critical finding underscores the need to
develop augmentation strategies that go beyond the mere consideration of statistical properties.
This involves designing augmentation algorithms that are not only efective in enhancing the
diversity of the dataset but also sensitive to and mitigate potential biases.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background notions</title>
      <p>The basic concepts necessary to understand the methodology presented in this paper are
explained in this section.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Generation using Agent-Based Modeling</title>
        <p>In Agent-based modeling (ABM), each agent has distinct characteristics and behaviors, allowing
the exploration of emergent phenomena from their collective interactions. ABM emphasizes
capturing heterogeneity among agents and local interactions that drive overall system dynamics.
ABMs study the interactions among many independent decision-making agents in a discrete
spatiotemporal environment. The model operates through discrete time steps, during which
agents update their states based on predefined rules and responses to their local environment.
Agents in ABM can exhibit a wide range of behaviors, from random actions to adaptive decisions
based on prevailing conditions. This flexibility allows the modeling of diverse scenarios.</p>
        <p>While agent-based modeling (ABM) often involves specific rules tailored to the characteristics
of the modeled system, a general framework can be outlined to illustrate the workings of ABM.
Let  be the state of the system at time ,  represent the actions taken by individual agents
at time , and  denote the environment or context at time . The dynamics of the system can
be expressed through the following formula:
+1 =  (, , )
(1)</p>
        <p>In this formula: +1 is the state of the system at the next time step ( + 1).  is a function
that describes how the state of the system at time  + 1 is determined by its state at time , the
actions of agents (), and the environment ().</p>
        <p>This formula shows the future state of the system is afected by the current state, the actions
of the agents, and the environment. The specific form of the function  will vary depending on
the characteristics and rules governing the agents and their interactions in the modeled system.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Agent-Based Vector-Label Propagation algorithm (AVPRA)</title>
        <p>
          Graph-based semi-supervised learning hinges on the idea that adjacent nodes are likely to
share similar labels, an assumption becomes more valid referring to the known homophily
concept in social network analysis. Label Propagation (LP) algorithms embody this concept by
assigning labels to unlabeled nodes based on the similarities among their neighboring nodes.
Drawing inspiration from epidemic spread research, these algorithms compile information
spread through node contacts to define an individual’s features, rather than just assigning a
single label. This approach not only uncovers the structural details of the graph but also ensures
a balanced representation of the existing features. Expanding on LP algorithms, Label-Vector
Propagation algorithms assign a vector of labels to each node instead of one. Our earlier study
[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] introduced the application of an agent-based algorithm in the Vector-label Propagation
technique. This algorithm employs Vector Labels (VLs) to disseminate weighted labels across
the network via edges. A notable aspect of AVPRA is its unique VL size, enhancing its utility in
subsequent ML applications. Additionally, it features normalized coeficients within the vector,
summing to 1, thereby eliminating biased learning due to uneven feature distribution in the
graph. The rationale behind leveraging Agent-Based Model (ABM) is its potential to admit
a large number of parameters in modeling the propagation phenomena. ABM adopts simply
understandable rules in propagating features through links, tunable to the desired phenomena
and output [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. In this current work, we exploit the aforementioned [doubleblind] technique
to the correct assignment of labels to the synthetic data points.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. System Fairness</title>
        <p>
          Work on training ML systems that result in fair decisions have defined several useful
measurements for fairness: Demographic Parity, Equality of Opportunity, and Equality of Mis-Opportunity.
These can be imposed as constraints or incorporated into a loss function to mitigate
disproportionate outcomes in the system’s output predictions with respect to a protected demographic,
such as gender. Prior work in the same scope can be classified into three groups depending
on the approach applied to remove bias: Pre-processing algorithms [
          <xref ref-type="bibr" rid="ref36 ref37">36, 37</xref>
          ], In-processing
algorithms [
          <xref ref-type="bibr" rid="ref32 ref38">38, 32</xref>
          ] and, Post-processing algorithms [
          <xref ref-type="bibr" rid="ref39 ref40 ref41">39, 40, 41</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Dataset Complexity</title>
        <p>
          Given the fact that performance metrics do not provide a full insight into the level of complexity
of a classification problem that an algorithm has to deal with, we also need to investigate whether
the data set itself allows a clear separation between classes. In other words, to investigate how
complex a classification problem is in its and whether some features play a more efective role
than others, we need to consider other criteria. In this context, we emphasize that if a class is
clearly separated by a subset of sensitive features, this could be a sign of biased or
discriminatory behavior imposed on the system by the data set. According to [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ], these measures are
categorized as “feature-based", “Linearity", “Neighborhood", “Network", “Dimensionality", and,
“Class imbalance" measures, and introduce several metrics for this purpose. In this paper, we
evaluate some of them by comparing the complexity of the original dataset with the augmented
version.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The proposed Approach</title>
      <p>In this section, we explain our methodology and discuss how it aligns with an infrastructure
designed for continuous assurance verification.</p>
      <sec id="sec-4-1">
        <title>4.1. Assuring Fair Data Augmentation</title>
        <p>The starting point of our methodology is to redefine the data set through the lens of a similarity
network, following the methods described in Section 5.2. This provides us with a view of the
structural organization of the dataset that we can exploit in generating synthetic data and
interpreting our results. The synthetically generated data points form connections and inherit
features from their neighboring nodes. By favoring the generation of synthetic data in close
proximity to groups with lower density, we inherently balance the representation of diferent
subgroups in the dataset. The AVPRA algorithm ensures accurate labeling of these synthetic
points, as explained in Section 5.3. In addition, the edges of the similarity network provide
us with a latent feature space that protects the classifiers from bias due to a limited set of
features. This, in turn, mitigates the risk of data segmentation based on sensitive features,
as demonstrated in Section 6. Sections 5.4 provide a detailed explanation of the procedure
we followed. Another key point of our proposal is to exploit the incremental nature of our
ABM algorithm to insert our method into a continuous assurance infrastructure. The metrics
evaluated by our algorithm can be inserted into a library of assurance tests periodically executed
by the system to monitor the reliability of ML models, as explained in Section 4.3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Infrastructure</title>
        <p>
          Our method has been included in the MUSA platform. This platform provides a 5G-enabled
edgecloud continuum infrastructure, supporting processes with continuous assurance techniques for
advanced non-functional requirements. The platform provides multiple cloud services,
including storage, computing, and data pipeline management, guaranteeing strong non-functional
properties, such as data privacy and locality, security, and isolation, while providing high
availability and performance [
          <xref ref-type="bibr" rid="ref16 ref43">16, 43</xref>
          ].
        </p>
        <p>In this paper, we integrate our synthetic data generation methodology with some of the
services provided, in particular the storage, computing, pipeline and assurance platforms,
as described in Figure 1. We used services from the Apache stack because of their ease of
interoperability and open-source nature.</p>
        <p>Storage. The platform’s storage solution is based on Apache Hadoop and can be accessed
through a RESTful API using the WebHDFS protocol. The system also features access control and
logging, transparent replication and snapshotting, data lineage metadata, and high availability.
Computing. The computing platform utilized is based on Kubernetes and ofers containerized
Data
sources</p>
        <p>platform services
Assurance</p>
        <p>Storage
Computing</p>
        <p>Pipeline</p>
        <p>Model
Assurance</p>
        <p>Report
environments for Python-based data analysis. Additionally, the platform provides highly
parallelized execution solutions through Apache Spark and Trino clusters, in combination with
GPU-accelerated workflows.</p>
        <p>Pipeline. This research utilized the Apache Airflow pipeline platform. Apache Airflow is an
extensible DAG-based execution scheduler that enables the coordination of analysis pipelines
and integrates with numerous services and components through Python bindings and APIs.
Assurance. The platform’s assurance component monitors the services in use and continuously
verifies a set of required properties. To identify which checks need to be executed, the assurance
service uses annotations on the pipeline DAG and a set of templates for common checks. The
solution is easily extendable, allowing us to implement custom verification tasks within the
pipeline itself.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Assurance and Certification</title>
        <p>The verification and certification of the models’ properties is implemented by a continuous
assurance process that manages the data analysis and model training tasks in the edge-cloud
continuum. This process checks evidence collected from the tasks against a set of formal
contracts, inferring whether the associated properties are holding. We first define a set of
desired non-functional properties that we want to verify. These are abstract concepts that
indicate an inherent behavior or quality of the ML model, in this case, data anonymity, accuracy,
and fairness. In order to ensure the validity of a property, it is necessary to gather evidence on the
task. This can be achieved by collecting metrics that provide an objective view of the system’s
state and behavior. Some of these metrics are already available through the execution platform,
such as data access logging, execution tracing, resource usage, and source code security and
quality analysis. Others, such as the model’s scoring metrics (accuracy, fairness), are task/model
specific and need to be implemented by the data pipeline. The platform continuously collects
measurements of the pipeline’s aspects and saves them in a time-series optimized storage
solution. Monitoring can be continuous (at a certain frequency or event-based) or on-demand,
only running when required by the assurance process. The next step is to define the contracts
for the process, each verifying a particular property. These contracts interact with the metrics
interface, querying for previously collected or on-demand evidence. An Accredited Lab, a trusted
entity with full access to the evidence, executes the contracts and provides a report on the
inferred information and property status. . In this case, the model is an artifact of the data
pipeline and its properties are not expected to change after its release. Therefore, the model
and its certificate have tightly linked lifetimes, and certificate revocation is unnecessary.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Dataset</title>
        <p>In order to evaluate the proposed methodology, we used a public data set of the Curriculum
Vitae (CV) of 301 employees1 which contains both numerical and categorical features.
Payment Categorization: Given that the payment rate is a continuous value in the dataset, but
our objective is to address a classification problem, we categorize the Payment using automated
bin selection. This method determines the number of categories based on maximizing instances
within a bin. Consequently, the number of bins varies according to the data distribution,
potentially resulting in a multi-class classification problem.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Similarity measurements</title>
        <p>
          Graph construction from datasets based on instances’ similarities can be implemented using
various mechanisms to measure similarity values. In this paper, to create the weighted network
 = (V, E), where V corresponds to the network nodes (vertices), and E to the links (edges)
among them, we use Gower distance() that is often used for dealing with heterogeneous data
[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. For a set of  variables G={1, 2, ..}, the similarity S′ between two patients a and b is
then defined as the average of the Gower Similarity ( = 1 −  ) for each of the variables:
′, =
∑︀=1 (, , )
        </p>
        <p>
          The creation of a similarity network requires choosing the threshold value of pairwise
similarities,  , that defines the existence (non-existence) of links between points. To make this
issue less crucial, kernel matrices that are able to capture the topological characteristics of the
underlying graph may provide a solution. Edge weights are represented by an  ×  similarity
matrix W where W(a, b) indicates the similarity between data points a and b. We use a scaled
Exponential kernel (Ek) to determine the weight of the edges:
 (, ) = (− 2(, ) ),
 ,
where according to the [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ],  is a hyperparameter that can be empirically set and  , is
used to eliminate the scaling problem by defining:
        </p>
        <p>
          ( (, )) + ( (, )) +  (, )
 , = 3 , (4)
where ( (, )) is the average value of the distances between  and each of its  NNs.
The range of  = [0.3, 0.8] is recommended by [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ].
(2)
(3)
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Implementation of AVPRA</title>
        <p>Leveraging this algorithm, each node(edge) of the graph is mapped into a latent space by a
d-dimensional Vector-Labels (VL) containing normalized coeficients of the target labels in the
dataset (∑︀ VL[]() = 1). In this paper, we have altered the aggregation function as Eq. 5, to
capture the weight of links, W , we calculated before using EK.</p>
        <p>() = VL[]() = 1VL[]() + 2 ∑︁</p>
        <p>W VL∈Γ()[]( − 1)</p>
        <p>(5)
∈Γ()</p>
        <p>According to this algorithm, at each time step , (), the belonging coeficient of an element
of the VL[](), can be updated by aggregating the  neighbors’ VL∈Γ()[]( − 1). Where 1
and 2 are the weight of currently assigned labels  of node  and the weight of the neighbors
Γ(), respectively. In a basic scenario, 1 = 2 = Γ()+1 =  +11 , hence, for all the common
1
elements in VL and VL vectors, the values of the given elements  increase unconditionally
and will be normalized to 1 by the inverse of the cardinality of Γ().</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Model Description</title>
        <p>Considering a given dataset, aimed to be augmented, we initiate our approach by creating a
similarity graph out of  data points. The quantified similarity values are calculated out of
the feature space as previously described in Sec. 5.2 to construct the edges among nodes in
the similarity graph. Synthetic data generation will take place repeatedly until the assurance
condition is provided. Regarding the calibration of Fairness metrics, Equal Opportunity and
Equal mis-opportunity, as assurance criteria, we need to continuously evaluate them after adding
some synthetic data points. At each round, we add  nodes with features {fi} adopted
randomly from the same feature distribution as the original dataset without labels assigned.
At each round, by measuring the similarity of one synthetic data point with all the original
data points, we create links only if the similarity value is greater than the average weights
of the original nodes with original neighbors (see Fig. 2). Having the updated network with
newly generated synthetic nodes and links, we initiate the AVPRA algorithm, difusing the
labels from the original data points to the other part of the similarity network. For each node
 ∈  , the labels are vectorized to the maximum dimension of the number of target labels.
For instance, if the labels contain three diferent categories such as  where for node , after
uniquely vectorizing the labels,   = [1, 2, 3] where  = 1 if the  =  and the rest are
equal to zero. As the result of the AVPRA algorithm using the aggregation function in Eq. 5
after multiple iterations, the VL of synthetic nodes gets updated. The most weighted label of the
VL will be adopted as the label of the synthetic node. At the end of each round of augmentation
by %, the fairness metrics evaluation takes place in order to ensure the augmentation process
is not replicating the bias in the synthetic data set.</p>
        <p>To assess the equilibrium between accuracy, transparency, and fairness, we employ a range
of evaluation metrics in our analysis.</p>
        <p>Considering a predictive task based on the dataset, we propose implementing a classification
model to predict the payment levels of employees. To facilitate the analysis, we explore various
scenarios, difering by dataset representation as Original data set; this is the dataset after
preprocessing and categorization of payments into Pay rate levels. Balanced data set; a
preprocessed dataset where we have addressed the class imbalance in the Pay rate levels. Data set with
balanced groups; a preprocessed dataset that is balanced considering protected features for
discriminated groups. In our experiment, we balanced by the Gender and Hispanic/Latino
features. Augmented data using our proposed method; Leveraging the proposed algorithm,
we have augmented the data set by 100 percent.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental results</title>
      <p>After setting up the workflow and infrastructure as extensively described in the Section. 5, the
evaluation metrics on the given dataset are presented as follows:
Accuracy: in order to explore the fidelity of our proposed method used for the augmentation
purposes, in preserving the dataset still informative for the prediction purposes, we trained two
classifiers predicting the Pay rate of employees in the given dataset. Table 1, represents the
implemented models on diferent dataset represenations. The original dataset acquires the
 1 score about 67% for both Classifiers. In the second trial, the data set classes are balanced
using the SMOTE algorithm, we witness an increase in the performance of the prediction
compared to the original data. Regarding the bias reduction, introduced by the sensitive
features into the dataset afecting the output of the predictors, we have tried to balance the
dataset (by adding) according to the sensitive features such as gender and ethnicity in our
dataset. As a result, in comparison to the original dataset, the balanced datasets are slightly
more accurate. The above mentioned models are implemented using 10-fold cross-validated data.</p>
      <p>Transparency: Investigating the importance of features in the output of a predictive model
leads to pointing out those sensitive features playing a role in the model. Fig. 3 is presenting
the SHAP values resulted by an RF algorithm on three datasets; original, Balanced classed
using SMOTE, and, augmented data using our method. Considering these plots, we witness
the depolarised distribution of data point values for sensitive features using in the augmented
dataset in contrary to the other datasets.
Dataset representations Random Forest XGBOOST
Original data set 0.666 ± 0.056 0.680 ± 0.059
wiOthribgailnaanlcdeadtaclsaestses 0.842 ± 0.038 0.817 ± 0.046</p>
    </sec>
    <sec id="sec-7">
      <title>Orpigriontaelctdeadtafesaettuwreit(hgebnaldaenrc)ed 0.700 ± 0.071 0.686 ± 0.067</title>
    </sec>
    <sec id="sec-8">
      <title>Oprirgoitneaclteddatfaeasetutrwei(tehthbnailcaintyce)d 0.783 ± 0.049 0.777 ± 0.055</title>
      <p>Augmented dataset 0.645 ± 0.043 0.667 ± 0.015</p>
      <p>
        Classification complexity measurments: In order to have an insight on the dataset
distribution specification, for measuring how a classification task could be dificult to implement
on a given dataset we have calculated some of the common metrics comparing the original
dataset before and after augmentation process. The 2 measure that is a well known index
computed for measuring class balance demonstrate an improvement by 60% in balancing
the classes using our method. Clustering Coeficient and density of networks before and after
augmentation do not tangibly change. The Collective Feature Eficiency (  4) measure to
get an overview of how the features work together [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ], lower values of  4 indicate that it
is possible to discriminate more examples and, therefore, that the problem is simpler. Our
measurements shows  4 increased by 34% which means the classification problem is more
dificult in augmented data due to debiasing process as we expected. Another crucial metric
to consider is Dimensionality Measures which are indicatives of data sparsity. Measuring  4
shows that the dataset sparsity stays the same as original dataset with negligible changes.
Fairness: In order to prove that our augmentation method is improving the fairness by creating
synthetic data points, we have measured Equal Opportunity and Equal Mis-Opportunity as
mentioned in the Supplementary. Table 2 demonstrates the measured metrics on the original
data and after augmentation in which the bias has clearly improved in the extended version.
      </p>
    </sec>
    <sec id="sec-9">
      <title>7. Discussion and Conclusion</title>
      <p>In this paper, we focus on the implementation of an innovative assurance procedure for fair data
augmentation. To achieve this goal, we present a comprehensive methodology and infrastructure
that provides an environment for the seamless implementation of our method. Throughout this
process, we continuously evaluate certification criteria, such as the accuracy of ML models or
fairness metrics (see Section 5 for experimental setup details). Our methodology is based on
computing the similarity, quantified and tuned by a kernel function, to construct a similarity
network of instances from a given dataset. The generation of synthetic nodes, each with features
randomly drawn from the original feature distribution, is done by creating links to more similar
nodes in the initially constructed network. Building on our previous work on the AVPRA
algorithm for semi-supervised learning, we assign labels to the unlabeled synthetic nodes after
several iterations of the AVPRA method. This involves aggregating and propagating labels
across links based on their weights. We provide a detailed discussion of the proposed approach,
outlining all the necessary steps, in Section 4. After establishing the workflow (see Section 5),
we perform a comprehensive evaluation of several metrics on diferent data representations
and report the results in Section 6. Our results confirm the efectiveness of our approach in
augmenting datasets while ensuring fairness metrics. We evaluate the accuracy of predicting
employee pay rates in several scenarios, including the original dataset, an original dataset with
balanced classes, an original dataset with balanced sensitive characteristics (Sex and ethnicity),
and an augmented dataset using our methodology. In particular, we observe an increase in
accuracy in the balanced dataset, both in terms of classes and subgroups. However, for the
augmented dataset, the accuracy remains stable compared to the original data, possibly due to
the debiasing process that decentralizes the focus of the classification tasks. Taking transparency
into account, we visualize the crucial, potentially sensitive features by using the SHAP values,
as shown in Figure 3. The figures show biases in the output of wage rate predictions with
respect to Sex and Age features. Even with a balanced data set, the sensitivity of the features
remains. However, in the augmented data using our methodology, the biases in the feature
values are significantly improved. In addition, we examine measures of dataset complexity
in the augmented dataset and observe improvements towards a more fair dataset in factors
such as class balance and collective feature eficiency. Stability is observed in other factors,
including Clustering Coeficient , Density, and T4. In the final stages of our evaluation process, we
periodically measure fairness levels after each round of synthetic data point generation. This
is done by calculating metrics such as Equal Opportunity and Equal Mis-Opportunity, which
correspond to True Positive and False Positive rates, respectively. We made a binary classification
for two wage levels, considering the privileged and non-privileged groups. We considered
Gender and Age as two discriminating features, defining the privileged group as Sex: male
and Age ≤ 40, while considering the remaining instances as an unprivileged group. Table 2
shows that the fairness metrics have improved in the extended dataset (by about 50%).</p>
    </sec>
    <sec id="sec-10">
      <title>8. Conclusion</title>
      <p>Implementing accurate ML models requires a handful of dataset. Data scarcity may force a data
scientist to adopt augmentation techniques for enlarging the original dataset. Concerning the
possible biases in the original dataset, for instances imposed by data collector, the augmentation
process may amplify the bias in the extended data set. To address this issue we have proposed a
methodology based on the similarity network representation of dataset. Links in the similarity
network, composed of original and synthetic nodes, are adjusted using exponential kernel
functions mapping the similarity values of nodes smoothly to the weight of links. Our previously
proposed algorithm, doubleblind, is considered for labelling newly synthetic nodes in a
semisupervised fashion. The main objectives of this work as reducing the dependency of ML models
on sensitive features while improving the fairness is achieved while evaluating by diferent
metrics continuously in a big data ecosystem.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Lytras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          , E. Damiani,
          <article-title>Big data and data analytics research: From metaphors to value space for collective wisdom in human decision making and smart machines</article-title>
          ,
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 13</source>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowls</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beltrametti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chatila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chazerand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dignum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luetge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Madelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Pagallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rossi</surname>
          </string-name>
          , et al.,
          <article-title>An ethical framework for a good ai society: Opportunities, risks, principles, and recommendations</article-title>
          , Ethics, governance,
          <source>and policies in artificial intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>19</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Doneda</surname>
          </string-name>
          ,
          <article-title>On the development of ai governance frameworks</article-title>
          ,
          <source>IEEE Internet Computing</source>
          <volume>27</volume>
          (
          <year>2023</year>
          )
          <fpage>70</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Riccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jahangirova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stocco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Humbatova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Tonella,</surname>
          </string-name>
          <article-title>Testing machine learning based systems: a systematic mapping</article-title>
          ,
          <source>Empirical Software Engineering</source>
          <volume>25</volume>
          (
          <year>2020</year>
          )
          <fpage>5193</fpage>
          -
          <lpage>5254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Anisetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ardagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bena</surname>
          </string-name>
          , E. Damiani,
          <article-title>Rethinking certification for trustworthy machine learning-based applications</article-title>
          ,
          <source>IEEE Internet Computing</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shokri</surname>
          </string-name>
          ,
          <article-title>On the privacy risks of algorithmic fairness</article-title>
          ,
          <source>in: 2021 IEEE European Symposium on Security</source>
          and
          <string-name>
            <surname>Privacy (EuroS&amp;P),</surname>
            <given-names>IEEE</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>292</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Rastegarpanah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crovella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Gummadi</surname>
          </string-name>
          ,
          <article-title>Fair inputs and fair outputs: The incompatibility of fairness in privacy and accuracy</article-title>
          ,
          <source>in: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Gudivada</surname>
          </string-name>
          ,
          <article-title>A case study of the augmentation and evaluation of training data for deep learning</article-title>
          ,
          <source>Journal of Data and Information Quality (JDIQ) 11</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maharana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nemade</surname>
          </string-name>
          ,
          <article-title>A review: Data pre-processing and data augmentation techniques</article-title>
          ,
          <source>Global Transitions Proceedings</source>
          <volume>3</volume>
          (
          <year>2022</year>
          )
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          , M. Mitchell,
          <article-title>50 years of test (un) fairness: Lessons for machine learning</article-title>
          ,
          <source>in: Proceedings of the conference on fairness, accountability, and transparency</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <source>Fairness and Machine Learning: Limitations and Opportunities</source>
          , MIT Press,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kamiran</surname>
          </string-name>
          , T. Calders,
          <article-title>Classifying without discriminating</article-title>
          , in:
          <year>2009</year>
          <article-title>2nd international conference on computer, control and communication</article-title>
          , IEEE,
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Roselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matthews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Talagala</surname>
          </string-name>
          ,
          <article-title>Managing bias in ai</article-title>
          ,
          <source>in: Companion Proceedings of The 2019 World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maghool</surname>
          </string-name>
          , E. Casiraghi,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          ,
          <article-title>Enhancing fairness and accuracy in machine learning through similarity networks</article-title>
          ,
          <source>in: International Conference on Cooperative Information Systems</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ghirimoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maghool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <article-title>Validating vector-label propagation for graph embedding</article-title>
          ,
          <source>in: International Conference on Cooperative Information Systems</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Anisetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ardagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Berto</surname>
          </string-name>
          ,
          <article-title>An assurance process for Big Data trustworthiness</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>146</volume>
          (
          <year>2023</year>
          )
          <fpage>34</fpage>
          -
          <lpage>46</lpage>
          . URL: https://www.sciencedirect. com/science/article/pii/S0167739X23001371. doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2023</year>
          .
          <volume>04</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacao</surname>
          </string-name>
          ,
          <article-title>Tabular and latent space synthetic data generation: a literature review</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Douzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacao</surname>
          </string-name>
          ,
          <article-title>Geometric smote a geometrically enhanced drop-in replacement for smote</article-title>
          ,
          <source>Information sciences 501</source>
          (
          <year>2019</year>
          )
          <fpage>118</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Menardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Torelli</surname>
          </string-name>
          ,
          <article-title>Training and assessing classification rules with imbalanced data, Data mining and knowledge discovery 28 (</article-title>
          <year>2014</year>
          )
          <fpage>92</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <article-title>Learning from imbalanced data: open challenges and future directions</article-title>
          ,
          <source>Progress in Artificial Intelligence</source>
          <volume>5</volume>
          (
          <year>2016</year>
          )
          <fpage>221</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C.-T. Lin</surname>
            , T.-Y. Hsieh, Y.-T. Liu,
            <given-names>Y.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Y.-K.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Yen</surname>
            ,
            <given-names>N. R.</given-names>
          </string-name>
          <string-name>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Chuang</surname>
          </string-name>
          ,
          <article-title>Minority oversampling in kernel adaptive subspaces for class imbalanced datasets</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>950</fpage>
          -
          <lpage>962</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Douzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacao</surname>
          </string-name>
          , G-somo:
          <article-title>An oversampling approach based on self-organized maps and geometric smote</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>183</volume>
          (
          <year>2021</year>
          )
          <fpage>115230</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>C. M. Macal</surname>
            ,
            <given-names>M. J. North,</given-names>
          </string-name>
          <article-title>Tutorial on agent-based modeling and simulation</article-title>
          ,
          <source>in: Proceedings of the Winter Simulation Conference</source>
          ,
          <year>2005</year>
          ., IEEE,
          <year>2005</year>
          , pp.
          <fpage>14</fpage>
          -pp.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Cocucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pulido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Aparicio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruíz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Simoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <article-title>Inference in epidemiological agent-based models using ensemble-based data assimilation</article-title>
          ,
          <source>Plos one 17</source>
          (
          <year>2022</year>
          )
          <article-title>e0264892</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>M. B. Hooten</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Hanks</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lowry</surname>
          </string-name>
          ,
          <article-title>Agent-based inference for animal movement and selection</article-title>
          ,
          <source>Journal of Agricultural, Biological and Environmental Statistics</source>
          <volume>15</volume>
          (
          <year>2010</year>
          )
          <fpage>523</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>D</surname>
          </string-name>
          .
          <article-title>-o.</article-title>
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Bae</surname>
            ,
            <given-names>C.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jung</surname>
          </string-name>
          , E. Paik,
          <article-title>Self-evolving agent-based simulation platform for predictive analysis on socio-economics by using incremental machine learning</article-title>
          ,
          <source>in: Proceedings of the 2018 Winter Simulation Conference</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4254</fpage>
          -
          <lpage>4254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lamperti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roventini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sani</surname>
          </string-name>
          ,
          <article-title>Agent-based model calibration using machine learning surrogates</article-title>
          ,
          <source>Journal of Economic Dynamics and Control</source>
          <volume>90</volume>
          (
          <year>2018</year>
          )
          <fpage>366</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cirillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallegati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Vagliasindi</surname>
          </string-name>
          ,
          <article-title>Validating and calibrating agent-based models: a case study</article-title>
          ,
          <source>Computational Economics</source>
          <volume>30</volume>
          (
          <year>2007</year>
          )
          <fpage>245</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Riddle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Macal</surname>
          </string-name>
          , G. Conzelmann,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Combs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fields</surname>
          </string-name>
          ,
          <article-title>Global critical materials markets: An agent-based modeling approach</article-title>
          ,
          <source>Resources Policy</source>
          <volume>45</volume>
          (
          <year>2015</year>
          )
          <fpage>307</fpage>
          -
          <lpage>321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sauber-Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <article-title>The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>98</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Generative adversarial nets,
          <source>Advances in neural information processing systems</source>
          <volume>27</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lemoine</surname>
          </string-name>
          , M. Mitchell,
          <article-title>Mitigating unwanted biases with adversarial learning</article-title>
          ,
          <source>in: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bradshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kollanyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bolsover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Trump</surname>
          </string-name>
          ,
          <article-title>Junk news and bots during the french presidential election: What are french voters sharing over twitter in round two? comprop data memo</article-title>
          <year>2017</year>
          .4 / 4 may
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Naeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>An exploration of how fake news is taking over social media and putting public health at risk</article-title>
          ,
          <source>Health Information &amp; Libraries Journal</source>
          <volume>38</volume>
          (
          <year>2021</year>
          )
          <fpage>143</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maghool</surname>
          </string-name>
          ,
          <article-title>Agent-based vector-label propagation for explaining social network structures</article-title>
          ,
          <source>in: International Conference on Knowledge Management in Organizations</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>306</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kamiran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Calders</surname>
          </string-name>
          ,
          <article-title>Data preprocessing techniques for classification without discrimination</article-title>
          ,
          <source>Knowledge and information systems 33</source>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>M.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Friedler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moeller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scheidegger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Venkatasubramanian</surname>
          </string-name>
          ,
          <article-title>Certifying and removing disparate impact</article-title>
          ,
          <source>in: proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Sang,
          <article-title>Towards accuracy-fairness paradox: Adversarial example-based data augmentation for visual debiasing</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4346</fpage>
          -
          <lpage>4354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kamiran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mansha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karim</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Exploiting reject option in classification for social discrimination control</article-title>
          ,
          <source>Information Sciences 425</source>
          (
          <year>2018</year>
          )
          <fpage>18</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pleiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>On fairness and calibration</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Srebro</surname>
          </string-name>
          ,
          <article-title>Equality of opportunity in supervised learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Lorena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Souto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>How complex is your classification problem? a survey on measuring classification complexity</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 52</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>M.</given-names>
            <surname>Anisetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Berto</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Jeon, A DevSecOps-based Assurance Process for Big Data Analytics</article-title>
          , in: 2022
          <source>IEEE International Conference on Web Services (ICWS)</source>
          , IEEE, Barcelona, Spain,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https://ieeexplore.ieee.org/document/9885738/. doi:
          <volume>10</volume>
          .1109/ICWS55610.
          <year>2022</year>
          .
          <volume>00017</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Gower</surname>
          </string-name>
          ,
          <article-title>Some distance properties of latent root and vector methods used in multivariate analysis</article-title>
          ,
          <source>Biometrika</source>
          <volume>53</volume>
          (
          <year>1966</year>
          )
          <fpage>325</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Mezlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Demir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fiume</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brudno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haibe-Kains</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldenberg</surname>
          </string-name>
          ,
          <article-title>Similarity network fusion for aggregating data types on a genomic scale</article-title>
          ,
          <source>Nature methods 11</source>
          (
          <year>2014</year>
          )
          <fpage>333</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>A.</given-names>
            <surname>Orriols-Puig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Macia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>Documentation for the data complexity library in c++</article-title>
          , Universitat Ramon Llull,
          <source>La Salle</source>
          <volume>196</volume>
          (
          <year>2010</year>
          )
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>