<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Augmentation for Data-Centric AI Through the Lens of Semantic Technologies: A Position Paper</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Bertillo</string-name>
          <email>daniele.bertillo@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Cabibbo</string-name>
          <email>luca.cabibbo@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Cima</string-name>
          <email>cima@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valter Crescenzi</string-name>
          <email>valter.crescenzi@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Console</string-name>
          <email>console@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Maria Delfino</string-name>
          <email>delfino@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Iannucci</string-name>
          <email>stefano.ianucci@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Lembo</string-name>
          <email>lembo@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Lenzerini</string-name>
          <email>lenzerini@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Marconi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Merialdo</string-name>
          <email>paolo.merialdo@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Napoleone</string-name>
          <email>mar.napoleone3@stud.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Papi</string-name>
          <email>papi@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonella Poggi</string-name>
          <email>poggi@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Maria Scafoglieri</string-name>
          <email>scafoglieri@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <email>riccardo.torlone@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Roma Tre University</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Data augmentation is a fundamental technique in machine learning to enhance model generalization by artificially expanding training datasets. However, conventional augmentation approaches often rely on heuristic transformations that may not fully capture domain-specific knowledge. This position paper advocates a data-centric AI perspective on data augmentation, emphasizing the integration of semantic technologies, particularly domain ontologies, to guide augmentation strategies. The use of techniques from Symbolic AI for data augmentation has been dealt with only in a few recent papers. Our goal is to explore further this idea, based on the consideration that an explicit representation of the domain may be helpful in two key tasks: optimizing the generation of new data, and validating the generated data, both fundamental steps for all data augmentation strategies. We aim at developing novel approaches that combine ontologies and data augmentation techniques to address these two tasks, in particular by relying on automated reasoning. We argue that leveraging knowledge representation and symbolic reasoning enables more principled and context-aware data augmentation, leading to improved model robustness and fairness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Augmentation</kwd>
        <kwd>Ontology Based Data Access</kwd>
        <kwd>Semantic Technologies</kwd>
        <kwd>Data centric Artificial Intelligence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, machine learning (ML) has emerged as a transformative technology across a wide
range of domains, from healthcare and finance to autonomous systems and natural language
processing. A critical factor in the success of ML models is their ability to generalize well to
unseen data, which is heavily influenced by the quality and diversity of the training datasets. In
this framework, data augmentation (DA) has become a fundamental technique to enhance model
generalization by artificially expanding training datasets through data transformations [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
These transformations aim to introduce variability into the data, thereby enabling models to learn
more robust and invariant representations. Numerous studies have explored individual methods
as well as combinations of multiple DA operations. Also, tools and libraries exist that implement
diverse DA methods. However, current approaches frequently lack systematic methodologies
for evaluating and optimizing DA strategies and crafting an efective DA strategy remains
a challenging and time-consuming process, often requiring domain-specific expertise [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
Moreover, conventional DA approaches often rely on heuristic transformations, such as random
rotations, translations, or noise injection, which may not fully capture the underlying
domainspecific knowledge.
      </p>
      <p>
        This position paper advocates for a novel data-centric AI perspective on data augmentation,
emphasizing the integration of semantic technologies, particularly domain ontologies, to guide
augmentation strategies. While research on data augmentation is far from new, the coupling
with semantic technologies remains largely unexplored. Although some studies employ
ontologies in the data augmentation process [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], they predominantly utilize them as vocabulary
resources [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In contrast, our goal is to investigate the idea that an explicit representation of
the domain can play the first citizen role in two key tasks that are fundamental to the success
of any data augmentation strategy: optimizing the generation of new data and validating the
generated data.
      </p>
      <p>
        The integration of domain ontologies into data augmentation processes ofers several
advantages. First, ontologies provide a structured and formal representation of domain knowledge,
enabling the generation of data that is semantically consistent with the underlying domain.
Second, ontologies can facilitate the validation of augmented data by providing a framework for
automated reasoning, ensuring that the generated data adheres to domain-specific constraints
and rules. Moreover, the use of symbolic reasoning in conjunction with data augmentation opens
up new possibilities for addressing challenges related to model interpretability and fairness [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
By explicitly encoding domain knowledge and constraints, we can ensure that the augmented
data reflects the desired properties and avoids biases that may be inadvertently introduced
through heuristic transformations. This is particularly relevant in applications where fairness
and transparency are critical, such as in healthcare or criminal justice.
      </p>
      <p>
        This work is part of a wider project, called ENDURANCE, which is a builds upon seminal
work in data provenance [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] and explainable AI (xAI) [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ], aiming to provide a more
holistic and interpretable approach to DA engineering.
      </p>
      <p>The rest of the paper is organized as follows. In Section 2 we review the main DA methods
proposed in the literature. In Section 3, we introduce ontologies and describe the role they play
in data management. Then, in Section 4, we depict the approaches we aim at exploring in order
to exploit ontologies for DA, while in Section 5 we conclude the paper by discussing future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Augmentation</title>
      <p>
        DA methods can be broadly categorized into three main approaches, though it is important to
acknowledge that many techniques draw inspiration from multiple domains, often blending
ideas across these categories [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
      </p>
      <p>Heuristic or rule-based augmentation methods rely on manually defined transformations,
making them relatively simple to implement and computationally inexpensive. However,
their efectiveness is often dependent on the expertise of domain specialists who tailor the
transformations to the specific dataset. While this approach can yield useful variations of
data, it may struggle to introduce the level of diversity needed to significantly enhance model
performance.</p>
      <p>
        A more dynamic class of techniques is mixing-based augmentation, which generates new
training samples by interpolating between two or more existing examples. Originally designed
for image processing, these methods have since been adapted for NLP, time series, and tabular
data. However, the discrete nature of language tokens and structured data poses challenges,
as direct linear interpolation is often not feasible. To circumvent this issue, mixing-based
approaches typically operate within a continuous embedding space [
        <xref ref-type="bibr" rid="ref13 ref2 ref3">2, 3, 13</xref>
        ].
      </p>
      <p>
        Recent methods involve deep models explicitly trained to generate new data. Techniques
such as variational autoencoders (VAEs) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], generative adversarial networks (GANs) [15],
difusion models, and large language models fall into this category. These approaches have
the potential to create highly diverse and even novel examples that extend beyond the original
dataset’s distribution. However, if not carefully constrained, these models can generate
unrealistic or misleading samples that degrade rather than enhance learning. To mitigate this issue,
recent advances have explored leveraging large pre-trained models to generate data in a more
context-aware manner, ensuring that the augmented data remains relevant and meaningful
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>Each of these approaches ofers distinct advantages and trade-ofs. While rule-based methods
provide control and interpretability, they may lack the richness of data needed for more complex
learning tasks. Mixing-based methods ofer an eficient way to increase sample diversity but
require careful adaptation for discrete data types. Deep generative models, on the other hand,
represent the cutting edge of DA, capable of synthesizing high-quality data, though they require
significant computational resources and robust safeguards to maintain reliability. As research
in this area continues to evolve, the interplay between these approaches is likely to shape the
next generation of DA strategies.</p>
      <sec id="sec-2-1">
        <title>2.1. Image Data Augmentation</title>
        <p>DA in computer vision (CV) aims to enrich image datasets with diverse, label-consistent,
variations enabling exposure to previously unseen visual patterns.</p>
        <p>
          Rule-based Transformations. Simple geometric or photometric operations (e.g., random
lfipping, cropping, rotation, color jittering) preserve semantic labels yet diversify the input
distribution. These lightweight edits are ubiquitous in CNN training pipelines, ofering consistent
improvements with minimal overhead [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Noise and Erasing. Injecting noise into the image matrix (e.g., Gaussian noise injection [16])
or masking random regions, similar to dropout regularization (e.g., Cutout [17]) encourages
models to develop more robust feature representations. Variants of these techniques incorporate
content-aware erasing, allowing for more controlled manipulation of hidden information.
Mixing Strategies. Inspired by Mixup [18], several methods overlay or interpolate pairs of
images. This smooths the decision boundary by forcing the model to learn from “mixed” samples
(e.g., partial dog, partial cat). Such mixing may reduce overfitting and improve adversarial
robustness. Variants like CutMix [19] combine random patches from diferent images, efectively
blending object context while encouraging the network to leverage global cues.
Deep Generative Models. These methods utilize powerful generative models, such as GANs
and difusion-based approaches, to create realistic synthetic images while preserving essential
features like class labels [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Conditional GANs have been employed to generate images tailored
to underrepresented classes. Similarly, difusion models have gained traction for generating
diverse yet coherent augmentations by sampling from learned image distributions. These models
progressively refine images through a series of de-noising steps, ensuring realistic variations
while maintaining consistency with the original dataset.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. NLP Data Augmentation</title>
        <p>DA for natural language focuses on modifying text corpora by altering syntax while maintaining
semantic integrity.</p>
        <p>Rule-based Transformations. Simple lexical edits like synonym substitution or random
swaps expand a dataset with minimal overhead. For example, EDA [20] replaces or deletes
words to yield new training samples while preserving semantics—often enough to boost text
classification performance under low-resource conditions.</p>
        <p>
          Embedding-based Mixup. Adapted from computer vision’s Mixup [18], methods like MixText
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] generate “soft” data points by blending sentence embeddings. Unlike direct text-based
augmentations, this approach operates in the embedding space, where interpolated vectors may not
always correspond to syntactically valid sentences. Despite potential syntactic inconsistencies,
these synthetic samples can preserve meaningful semantic properties, contributing to improved
generalization in text classification tasks. This label-preserving technique smooths decision
boundaries by exposing the model to diverse latent-space representations.
        </p>
        <p>Backtranslation. A popular sequence-to-sequence approach, backtranslation [21] passes
text through two or more translation steps (e.g., English→Italian→English). The resulting
paraphrases have altered surface forms yet retain original meaning, yielding label-consistent
diversity valuable in low-resource or domain-shift scenarios.</p>
        <p>Language Models Generative architectures such as GPT [22] create synthetic text samples
by fine-tuning on specific labels or topics, expanding the linguistic diversity of training data.
Similarly, sequence-to-sequence models like T5 [23] facilitate augmentation through tasks such
as paraphrasing and controlled text transformations. Meanwhile, masked language models
(e.g., BERT [24]) contribute to DA by replacing tokens with contextually plausible alternatives,
improving coverage of rare patterns and linguistic styles. More recently, large language models
(LLMs) such as GPT-3.5 and LLaMA have demonstrated advanced capabilities in generating
lfuent and coherent text with minimal prompting, making them valuable tools for NLP DA.
Their ability to generate highly context-aware transformations introduces new possibilities for
enriching training corpora, though challenges such as computational costs and potential biases
in generated content must be carefully managed.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Tabular Data Augmentation</title>
        <p>
          Tabular data are widely used across various domains, typically found as web tables, databases,
or spreadsheets, containing both numerical and categorical features. Due to the discrete and
heterogeneous nature of tabular data, any augmentation procedure must adhere to
domainspecific constraints, such as column types, positivity constraints, and permissible value ranges
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Some research has proposed end-to-end pipelines for tabular DA (TDA), incorporating
multiple stages to ensure high-quality augmentation [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Broadly, TDA can be divided into two
main steps: pre-augmentation and augmentation. The pre-augmentation stage includes
essential table-processing tasks such as schema alignment, entity resolution, data fusion, and
data cleaning to address missing or incorrect values [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The augmentation step can be further
categorized into retrieval-based methods, which enrich table data by integrating relevant
information from external but available sources, and generative methods, which synthesize
new data by learning the statistical properties of the original dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Here, we focus on
generative augmentation approaches.
        </p>
        <p>
          Row-level Augmentation (Record Generation): New rows can be generated using models
that sample from learned data distributions, such as SMOTE [25], Gaussian Mixture Models, or
GANs. These methods help mitigate class imbalance or data scarcity by generating additional
training examples that preserve the statistical properties of the original data [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Column-level Augmentation (Feature Construction): When additional features are needed,
transformation-based feature engineering can be applied to existing attributes, or generative
models can be used to create entirely new columns. A key challenge in this process is ensuring
that the added features contribute meaningful information rather than introducing noise [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Cell-level Augmentation (Cell Imputation): Missing or incorrect cell entries can degrade
model performance. Generative methods such as GANs, difusion models, or large pre-trained
models can be used to impute missing values by leveraging learned distributions. Classical
imputation techniques, such as regression-based methods or MICE (Multiple Imputation by
Chained Equations), also remain efective for filling incomplete data [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Ontologies</title>
      <p>In Computer Science, "ontology" is a specific term denoting an artifact that is designed for
enabling the modeling of knowledge about some domain. In Artificial Intelligence, an ontology
is a symbolic representation of a domain interest, a formal tool used to capture and reason over
an explicit specification of the domain knowledge [26].</p>
      <p>There is actually a long history of this notion. The early "semantic networks" [27] and
"frame systems" [28] proposed and studied in the ’70s are the first examples of ontology
formalisms developed with the rise of symbolic artificial intelligence (AI) and structured knowledge
representation languages.</p>
      <p>In the ’80s, the research on object-oriented languages and conceptual data modeling borrowed
several principles from AI and adapt them towards the goal of devising new methodologies
for software development and database design. The common aspect of these formalisms are
constituted by representational primitives with which to model a domain, namely classes (or
concepts), attributes (or properties), and relationships (or relations among class members).</p>
      <p>In the ’90s, the research on Description Logics [29] provided solid logical foundations of
ontologies, continuing the work started by Brachman and Levesque in 1986 [30], and carrying
out a large body of research with the goal of understanding how the expressiveness of the
class definition mechanisms used in a given Description Logic influences the feasibility and the
efectiveness of automated reasoning algorithms. After more than three decades, we have now
a global picture of the expressiveness/complexity trade-of in Description Logics.</p>
      <p>In recent years ontologies are advocated as tools to provide a semantic layer for data
integration, knowledge graphs and machine learning, with the goal of improving efectiveness, quality
and explainability of several AI tasks.</p>
      <p>Ontology-Based Data Management (OBDM) [31] is a well-founded paradigm aiming at
enriching data with semantics, so as to rely on ontology reasoning for data management. Specifically,
an OBDM system is a three-layered architecture comprising an ontology, a set of autonomous
data sources and a so-called mapping establishing the relationship between the two. Given that
within this setting data are typically very large, literature on OBDM has mainly focused on the
study of forms of reasoning over the data that are tractable with respect to the data.</p>
      <p>While traditional reasoning over ontologies focused on inferring schema-level properties,
OBDM introduces the issue of using the conceptual level to access and manipulate data. For this
reason, the most studied form of reasoning in this setting is query answering, i.e. the problem
of logically deriving answers to queries, based on the knowledge expressed in the ontology and
the mappings. Notably, query answering is the basis for other relevant tasks in OBDM, such as
data quality checking [32, 33] and privacy preserving query processing.</p>
      <p>In general, OBDM resorts to symbolic techniques to infer (new) insights from data, as opposed
to ML, that induces insights from data through sub-symbolic techniques.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Approach</title>
      <p>As discussed in section 2, current DA techniques can be categorized into two broad classes:
rule-based or sub-symbolic, i.e., based on sub-symbolic AI approaches. As each approach shows
strengths that could help mitigating the other’s weaknesses, we argue that a more robust DA
framework should mix together techniques from both approaches. Specifically, taking into
account the semantics of the application domain defined by ontological axioms may improve two
fundamental steps performed via purely sub-symbolic DA techniques: optimizing the generation,
and strengthening the validation of synthetic data.</p>
      <p>In what follows, we propose several novel approaches to combine ontologies and sub-symbolic
DA techniques to address these two tasks. A crucial assumption for all these techniques is that
ontological axioms and training data are compatible, i.e., we can perform standard reasoning
tasks over the specification defined by the data and the axioms. For this purpose, we envision
the use of specifically crafted mappings, able to reconcile ML models and domain knowledge in
the ontology.</p>
      <p>Additionally, we assume that ontological axioms are always able to define correct and
meaningful information on the domain of interest. Due to the specific use cases we are considering,
this may require the use of expressive ontological languages that can handle probabilistic rules.
Generation The first task we aim to tackle is the optimization of the data generation process. In
this context, we discuss two diferent approaches in which an ontology is used to influence the
data generation process. For this purpose, let the generative process be denoted by a function
(·), called generator, which outputs new unseen data samples.</p>
      <p>• Ontology-guided generation: The ontology may be used to process the output of (·) and
improve it according to its axioms. In particular, one could use an ontology to verify
the consistency of the generated samples against the domain knowledge possessed by
experts and xfi possible inconsistencies. A similar approach is proposed by [ 34] where
logical constraints are incorporated in the output layer of a neural network. However,
such constraints are expressed purely in terms of formulae over the attributes of the
data samples. In the approach we propose, on the contrary, we want to formalize these
constraints using a higher level of abstraction, i.e. using the symbols defined by an
ontology. This will allow to perform reasoning over the domain, possibly inferring new
knowledge concerning the generated sample, and therefore further augmenting the data.
• Ontology as the generator: Another possible approach is to use the ontology as a set of
rules that define the behavior of the generator (·) itself. In this context, the ontology
could be used to generate data samples that are either already consistent with the domain
knowledge or even partially consistent. In the latter case, we could also exploit the
ontology to measure the inconsistencies and eventually address them.</p>
      <p>Validation Performing a validation step on the output of a generative technique is a common
practice that has been shown to improve the quality of the output. Usually, given a generator
(·) , its validation involves training a binary classifier called discriminative model, or
discriminator, (·) . This discriminator is trained to distinguish between real samples, i.e., those close
enough to the original dataset, and synthetic samples, i.e., those that that lack of some defining
features. Combining this technique with ontologies could lead to improved validation results.
• Ontology-enhanced validation: this approach employs the ontology to improve the
performance of a given discriminator (·) , i.e., providing a more accurate classification. To
this end, we propose two alternative strategies: ontology-first and discriminator-first. The
ontology-first strategy consists in applying the ontology rules to the input of (·) , before
the classification is computed. The intuition behind this approach is that ontological
axioms could be used to infer additional knowledge on the data samples. This knowledge
could enhance the input of (·) thus making the classification more precise. Adding
this information to the data samples, however, may require the modification of the the
architecture of (·) to ensure compatibility. In contrast, the discriminator-first approach
applies the ontological axioms to the output of (·) , verifying its consistency with the
domain knowledge, without interfering with the model architecture. For example, a
sample misclassified as real by (·) , could get the correct label synthetic after verifying
the violation of some domain rules.
• Ontology as the discriminator: another novel approach we propose is the use of an ontology
as the discriminator (·) itself. Instead of training a model to recognize synthetic data,
the ontology could be used to directly validate the output of a generator (·) checking
consistency directly against domain knowledge.</p>
      <p>Combining generation and validation in an online approach In the previous paragraphs
we described techniques to employ ontologies in an ofline fashion, i.e., given  and . A
natural, and particularly interesting, extension of our study is to investigate online approaches
where ontologies are integrated in the training phase of generative and discriminative machine
learning models. In particular, we propose to use this approach in conjunction with GAN
models, in which ontologies could be used to guide the learning process of both  and .</p>
      <p>Some of the most notable approaches in the literature are based on compiling the semantic
constraints into the loss function of the training algorithm [35, 36], or directly into the model
structure [37, 38]. However, none of these approach resort to an ontology to model the domain
knowledge, which is at the core of our proposal. We argue that leveraging domain ontologies in
the training phase of both the generative model (·) and the discriminative model (·) of a
GAN leads to a context-aware generation of data, and therefore to improve DA.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>This paper advocates for integrating domain ontologies into DA to address the limitations of
conventional heuristic approaches in knowledge-intensive ML applications. By formalizing
domain knowledge through semantic technologies, ontology-driven DA enhances both data
generation and validation processes.</p>
      <p>Three key findings emerge: (1) ontologies enable semantically consistent training data
expansion, overcoming the domain-agnostic nature of traditional transformations like rotations
or noise injection; (2) automated reasoning via ontological constraints ensures augmented
data validity, preventing semantic inconsistencies that could degrade model performance, and
(3) explicit knowledge representation supports fairness and interpretability in high-risks domains
like healthcare by mitigating biases inherent in purely statistical DA methods.</p>
      <p>
        The proposed framework bridges symbolic AI and data-centric paradigms, ofering systematic
strategies to align synthetic data with domain-specific rules. To validate the efectiveness of
the proposed approach, the project will conduct experimental analysis across multidisciplinary
scenarios encompassing three distinct use cases:
Entity Resolution: The first use case will involve entity resolution tasks on structured and
semi-structured data [
        <xref ref-type="bibr" rid="ref11">11, 39, 40</xref>
        ]. In this scenario, our objective is to evaluate how the proposed
validation approaches can improve known techniques for matching and linking records across
diferent data sources.
      </p>
      <p>NLP Scenario: In collaboration with the Senate of the Italian Republic, the second use case will
focus on clustering textual data (law amendments) [41], exploring how ontologies can support
generative approaches for text generation and NLP tasks.</p>
      <p>Hand Writer Identification: The third use case will address the writer identification problem
for medieval manuscripts [42]. This use case in the digital humanities will test the applicability
of DA techniques to image-based tasks. Tasks involving images may benefit from an online
approach to augmentation as we discussed in the previous sections.</p>
      <p>Future work should explore: (1) neuro-symbolic integration combining ontological reasoning
with neural data generation, (2) constraint-based augmentation using ontological rules, and
(3) automated validation pipelines verifying semantic consistency across augmentation cycles.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <sec id="sec-6-1">
        <title>This work has been supported by PNRR MUR project PE0000013-FAIR.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (2020)
139–144.
[16] F. J. Moreno-Barea, F. Strazzera, J. M. Jerez, D. Urda, L. Franco, Forward noise
adjustment scheme for data augmentation, in: 2018 IEEE symposium series on computational
intelligence (SSCI), IEEE, 2018, pp. 728–734.
[17] T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with
cutout, arXiv preprint arXiv:1708.04552 (2017).
[18] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk
minimization, arXiv preprint arXiv:1710.09412 (2017).
[19] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train
strong classifiers with localizable features, in: Proceedings of the IEEE/CVF international
conference on computer vision, 2019, pp. 6023–6032.
[20] J. Wei, K. Zou, Eda: Easy data augmentation techniques for boosting performance on text
classification tasks, arXiv preprint arXiv:1901.11196 (2019).
[21] R. Sennrich, B. Haddow, A. Birch, Improving neural machine translation models with
monolingual data, arXiv preprint arXiv:1511.06709 (2015).
[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[23] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
machine learning research 21 (2020) 1–67.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 conference of the
North American chapter of the association for computational linguistics: human language
technologies, volume 1 (long and short papers), 2019, pp. 4171–4186.
[25] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
[26] N. Guarino, D. Oberle, S. Staab, What is an ontology?, Handbook on ontologies (2009)
1–17.
[27] W. A. Woods, What’s in a link: Foundations for semantic networks, in: Representation
and understanding, Elsevier, 1975, pp. 35–82.
[28] M. Minsky, A framework for representing knowledge, 1974.
[29] F. Baader, W. Nutt, Basic description logics, in: The description logic handbook: theory,
implementation, and applications, 2003, pp. 43–95.
[30] R. J. Brachman, H. J. Levesque, Readings in knowledge representation, Technical Report,</p>
        <p>AT and T Bell Labs., 1985.
[31] A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati, Linking data to
ontologies, in: Journal on data semantics X, Springer, 2008, pp. 133–173.
[32] C. Daraio, M. Lenzerini, C. Leporelli, P. Naggar, A. Bonaccorsi, A. Bartolucci, The
advantages of an ontology-based data management approach: openness, interoperability and
data quality, Scientometrics 108 (2016) 441–455.
[33] M. Console, M. Lenzerini, Data quality in ontology-based data access: The case of
consistency, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 28,
2014.
[34] N. Hoernle, R. M. Karampatsis, V. Belle, K. Gal, Multiplexnet: Towards fully
satisifed logical constraints in neural networks, 2021. URL: https://arxiv.org/abs/2111.01564.
arXiv:2111.01564.
[35] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. V. den Broeck, A semantic loss function for
deep learning with symbolic knowledge, 2018. URL: https://arxiv.org/abs/1711.11157.
arXiv:1711.11157.
[36] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, DL2:
Training and querying neural networks with logic, in: K. Chaudhuri, R.
Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 1931–1941. URL:
https://proceedings.mlr.press/v97/fischer19a.html.
[37] G. G. Towell, J. W. Shavlik, M. O. Noordewier, Refinement of approximate domain theories
by knowledge-based neural networks, in: Proceedings of the Eighth National Conference
on Artificial Intelligence - Volume 2, AAAI’90, AAAI Press, 1990, p. 861–866.
[38] A. Daniele, L. Serafini, Neural networks enhancement with logical knowledge, 2021. URL:
https://arxiv.org/abs/2009.06087. arXiv:2009.06087.
[39] V. Di Cicco, D. Firmani, N. Koudas, P. Merialdo, D. Srivastava, Interpreting deep learning
models for entity resolution: an experience report using lime, in: Proceedings of the
second international workshop on exploiting artificial intelligence techniques for data
management, 2019, pp. 1–4.
[40] R. Fagin, P. G. Kolaitis, D. Lembo, L. Popa, F. Scafoglieri, A Framework for Combining
Entity Resolution and Query Answering in Knowledge Bases, in: Proceedings of the 20th
International Conference on Principles of Knowledge Representation and Reasoning, 2023,
pp. 229–239.
[41] A. Sajeva, S. Iannucci, C. Marchetti, P. Merialdo, R. Torlone, Clustering amendments with
semantic embeddings, SEBD 2024 (2024).
[42] L. Lastilla, S. Ammirati, D. Firmani, N. Komodakis, P. Merialdo, S. Scardapane,
Selfsupervised learning for medieval handwriting identification: A case study from the vatican
apostolic library, Information Processing &amp; Management 59 (2022) 102875.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shorten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <article-title>A survey on image data augmentation for deep learning</article-title>
          ,
          <source>Journal of big data 6</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A survey of data augmentation approaches for nlp</article-title>
          ,
          <source>arXiv preprint arXiv:2105.03075</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shou</surname>
          </string-name>
          , G. Chen,
          <article-title>Tabular data augmentation for machine learning: Progress and prospects of embracing generative ai</article-title>
          ,
          <source>arXiv preprint arXiv:2407.21523</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Skreta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arbabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Drysdale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brudno</surname>
          </string-name>
          ,
          <article-title>Automatically disambiguating medical acronyms with ontology-aware deep learning</article-title>
          ,
          <source>Nature communications 12</source>
          (
          <year>2021</year>
          )
          <fpage>5319</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          ,
          <article-title>Generating unseen diseases patient data using ontology enhanced generative adversarial networks</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>8</volume>
          (
          <year>2025</year>
          )
          <article-title>4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>A method for synthesizing ontology-based textual design datasets: Evaluating the potential of llm in domain-specific dataset generation</article-title>
          ,
          <source>Journal of Mechanical Design</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehrabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galstyan</surname>
          </string-name>
          ,
          <article-title>A survey on bias and fairness in machine learning</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          , G. Simonelli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Capturing and querying fine-grained provenance of preprocessing pipelines in data science</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>507</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lauro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Supporting better insights of data science pipelines with fine-grained provenance</article-title>
          ,
          <source>ACM Transactions on Database Systems</source>
          <volume>49</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , T. Teofili,
          <article-title>Explaining link prediction systems based on knowledge graph embeddings</article-title>
          ,
          <source>in: Proceedings of the 2022 international conference on management of data</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2062</fpage>
          -
          <lpage>2075</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Teofili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Martello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Efective explanations for entity resolution models</article-title>
          ,
          <source>in: 2022 IEEE 38th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>2709</fpage>
          -
          <lpage>2721</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lastilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ammirati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <article-title>How explainable is automatic hand identification? a case study</article-title>
          ,
          <source>in: UWA Conference</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>12239</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          , et al.,
          <source>Auto-encoding variational bayes</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>