<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>S-PIC4CHU: Semantics-based Provenance, Integrity, and Curation for Consistent, High-quality, and Unbiased Data Science</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianvincenzo Alfano</string-name>
          <email>g.alfano@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilaria Bartolini</string-name>
          <email>ilaria.bartolini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Calvanese</string-name>
          <email>diego.calvanese@unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Ciaccia</string-name>
          <email>paolo.ciaccia@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Greco</string-name>
          <email>greco@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Lanti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilia Lenzi</string-name>
          <email>emilia.lenzi@polimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Martinenghi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristian Molinaro</string-name>
          <email>c.molinaro@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Patella</string-name>
          <email>marco.patella@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Letizia Tanca</string-name>
          <email>letizia.tanca@polimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <email>riccardo.torlone@uniroma3.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Trubitsyna</string-name>
          <email>i.trubitsyna@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum University of Bologna</institution>
          ,
          <addr-line>DISI, 40100 Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano, Faculty of Engineering</institution>
          ,
          <addr-line>39100 Bolzano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Politecnico di Milano, DEIB</institution>
          ,
          <addr-line>20133 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Roma Tre University</institution>
          ,
          <addr-line>DICITA, 00146 Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Calabria, DIMES</institution>
          ,
          <addr-line>87036 Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper presents the vision of the S-PIC4CHU project, which aims to develop innovative models and techniques for scalable data preparation in Data Science and Machine Learning. The project focuses on leveraging data semantics throughout all data preparation stages to improve data quality and ensure unbiased results. The proposed approach involves a novel data preparation pipeline semantically enriched with domain knowledge from ontologies and knowledge graphs, along with novel, semanticbased techniques for data cleaning, integration, provenance, explanation, and quality management. The validation of the approach relies on use cases from diferent domains, with the goal of releasing open-source tools.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Science</kwd>
        <kwd>data preparation</kwd>
        <kwd>data quality</kwd>
        <kwd>semantics</kwd>
        <kwd>ontologies</kwd>
        <kwd>inconsistency</kwd>
        <kwd>incompleteness</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>provenance</kwd>
        <kwd>explanation</kwd>
        <kwd>bias</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing reliance on Data Science (DS) and Machine Learning (ML) techniques across
various sectors highlights the critical importance of data quality [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Real-world data is often
characterized by inaccuracies, noise, uncertainties, and inconsistencies, which can significantly
afect the results of DS and ML tasks. There are indeed various libraries and tools available to
correct erroneous values, impute missing ones, eliminate duplicate records, or disambiguate
conflicting data. However, the activities of data preparation are dificult to fully automate and
there is no solution today for composing, analyzing, and explaining end-to-end pipelines that
transform data from raw input into training sets ready to be used for learning. In addition,
these data processing activities merely refer to the syntax of the data and are rarely explained
or characterized in terms of their meaning. The S-PIC4CHU project addresses these challenges
by introducing a semantics-based approach to data preparation.
      </p>
      <p>
        The core idea is to develop a Data Preparation Pipeline (DPP) where data is annotated with
semantic information derived from ontologies and knowledge graphs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This semantic
enrichment plays a crucial role in subsequent data preparation steps, including data cleaning,
integration, transformation, reduction, deduplication, error detection, missing value imputation,
and space transformations. Furthermore, semantic techniques assist in reconciling conflicts
among diferent data quality dimensions, a well-known challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in the field of data quality.
      </p>
      <p>
        The project’s main objectives are to develop novel models and techniques for data preparation,
focusing on the semantic enrichment of data and ensuring data quality and fairness, also
providing provenance information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to enable explainable AI [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The challenges raised by
this paradigm shift in data preparation requires formalizing and addressing new aspects that
are relevant to the data preparation process, and investigating key challenging scientific issues
aimed at delivering novel solutions to well-known problems. The project contributes to the
development of open-source software tools and promotes awareness of data fairness, engaging
with various stakeholders.
      </p>
      <p>The project is validated on two selected use cases from diferent domains to showcase the
generality and efectiveness of the proposed solutions. S-PIC4CHU seeks to provide solutions
that are not only scientifically sound but also have significant societal and economic impact
due to the increasing importance of DS and ML in various sectors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The problem of developing a quality-aware pipeline of data preparation operations has been
the subject of recent research, see, e.g., [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Here we discuss more in detail the state of the art
in the areas that are crucial for the stages of data preparation in DS, noting that the literature
has so far paid little attention to the risks of ignoring potential conflicts among diferent data
quality dimensions.
      </p>
      <p>
        Data Semantics. A clear understanding of the data semantics plays a key role in all stages of
the data processing pipeline, therefore data semantics needs to be taken into account explicitly.
However, technologies and tools that provide semantics-based solutions, notably those relying
on ontologies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and automated reasoning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], currently are considered either in a limited way
or not at all in the steps of the pipeline. A critical aspect in the adoption of such technologies
is their scalability, w.r.t. both the size of the data and the size and complexity of the ontology.
Such aspects have been studied in the Semantic Web community, but mostly restricted to the
specific case of RDF (linked) data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], thus scalability of semantic technologies in the context
of a data processing pipeline is an open research challenge.
      </p>
      <p>Data Imputation. An approach to deal with the problem of missing values is data imputation,
i.e., replacing unspecified values with concrete ones [ 12, 13], for which statistical and machine
learning (e.g., Adversarial Networks) algorithms have been proposed. However, all algorithms
proposed so far deal only with raw datasets, i.e., not equipped with additional knowledge.</p>
      <sec id="sec-2-1">
        <title>Preference-based Inconsistency Management. While the management of inconsistent</title>
        <p>data has been studied extensively, incorporating preferences has received less attention. Even if
a few preference-based approaches have been proposed, e.g., [14, 15], they neither allow users
to express preferences on knowledge derived from the data, nor preconditions for preferences
to hold. The first recent proposal to overcome these limitations is [16].</p>
        <p>Data Dependencies. Functional dependencies (FDs) and their variants [17] have already
been used to enforce data quality. For instance, (approximate) conditional FDs, i.e., FDs that
hold only over a portion of the data, can be searched via data mining techniques, like association
rules [18]: these rules, and the possible non-conformant records, are returned to the user to
decide which of them must be fixed to improve data quality.</p>
        <p>
          Bias. When data is used to build models impacting people’s lives, we have to be sure that
data and the accordingly trained models do not introduce bias [19]. Thus, in such applications,
data can be considered of good quality only if they respect fairness requirements [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Methods
based on FDs and their variants can also be adopted to discover discrimination and bias in a
dataset [20], to avoid (possibly unintentional) unfair behavior and consequences.
Data Reduction. Before being fed into DS algorithms, data is usually “reduced”, to improve
the quality of results, using activities such as feature selection [21], object selection [22], data
aggregation, and clustering. The novel notion of ℱ -dominance [23, 24] allows for expressing
object and feature selection through a family of ranking functions with constraints on parameters
(e.g., weights). Considering a family rather than a single function improves the robustness and
lfexibility of the selection process.
        </p>
        <p>
          Multimedia Data Curation. Although DS technologies allow for extracting value from
large conventional data repositories, their application to multimedia (MM) data is still an open
research issue [25]. This is mainly due to the very complex nature of MM data whose content and
semantics still lack an appropriate methodology for accurate and eficient characterization [ 26].
The problem is further aggravated in the case of real-time analysis of MM streams [27].
Provenance and Explanation. Explaining positive/negative query answers, i.e., knowing
why a query result was/was not obtained, falls into the broad topic of Explainable AI. For
relational DBs, provenance [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is used in systems like Perm and ProvSQL [28] to keep track
of the specific DB tuples responsible for deriving an answer. For DS, provenance has been
applied, e.g., in data preparation [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For ontologies, explaining query answers has been studied
D0
        </p>
        <p>M0</p>
        <p>ST0
DP0</p>
        <p>M1
D1</p>
        <p>ST1−2</p>
        <p>D2
DP1</p>
        <p>DP2
D3</p>
        <p>M3</p>
        <p>ST3
DP3
D4</p>
        <p>M4
for diferent description logics (DLs) [ 29, 30, 31] under existential rules [32, 33], considering
inconsistency [34], and for Ontology-based Data Access [35], but several problems still need to
be investigated.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>We refer to the typical Data Preparation Pipeline (DPP), organized in a sequence of steps aimed
at transforming raw data into a clean, structured, and meaningful format that improves the
accuracy, eficiency, and robustness of machine learning models. A distinguishing feature of
our approach is in fact the semantic enrichment of data, whose aim is to annotate all data
that is involved in the various stages of the DPP with semantic information capturing domain
knowledge and coming from ontologies and knowledge graphs. The methodology for semantic
enrichment that we adopt in the project builds on Ontology-based Data Access (OBDA) and
extends it to cover multiple versions of the data corresponding to the pipeline stages. Specifically,
we capture at the semantic level, into a so-called Semantic Transformation Pipeline (STP), the
data and its transformations that constitute the DPP. We illustrate the resulting conceptual
architecture in Figure 1.</p>
      <p>As mentioned, the various stages  of the DPP are linked to each other through data
preparation steps DP . Those stages  that are (semantically) meaningful for the application
domain have a corresponding semantic stage counterpart in the STP, provided as an ontology
 to which they are linked through a semantic mapping ℳ. This correspondence induces, in
turn, a correspondence between sequences of data preparation steps (e.g., DP 1–DP 2 in the
ifgure) and single semantic transformation steps (e.g., ST 1−2 ).</p>
      <p>Our semantics-centered architecture serves as the basis for realizing the specific contributions
of the S-PIC4CHU approach, in which we concentrate on the key issues that follow.
Semantic Enrichment. To achieve semantic enrichment, we advance the state of the art in
several directions:</p>
      <p>1) We establish mechanisms to construct ontologies for STP stages, representing the data
handled in DPP stages. Our (semi-)automatic approach leverages both data and metadata
while identifying ontology elements that remain unchanged and those evolving with data
transformations.</p>
      <p>2) We develop novel methods to abstract data preparation operations of the DPP into ontology
transformation operations of the STP, balancing retention of detail vs. abstraction, based on the
granularity of semantic representations from Item 1.</p>
      <p>3) We (semi-)automatically derive semantic mappings between DPP stages and their STP
counterparts. For these mappings, in which queries provide the detailed correspondences
between the data layer and the semantic layer, we rely on OBDA mapping patterns [36], which
exploit relational constraints reflected in ontology constructs. We also consider operations at
both data and semantic levels that generate the various stages of the pipelines (cf. Item 2).</p>
      <p>4) We explore the automatic derivation of additional semantic mappings between STP stages
and intermediate DPP stages, including the original data sources, building on [37].</p>
      <p>The resulting semantic enrichments of DPPs aid in designing semantically-aware data
preparation tasks and in deriving provenance information and explanations, linking back to original
or intermediate data.</p>
      <p>Data Quality. Real-world data from multiple sources are often inconsistent or incomplete,
lowering the quality of DS tasks. Domain knowledge (e.g., ontologies, preferences) can enhance
data quality by addressing inconsistency, incompleteness, and fairness while ensuring eficient
data reduction. We approach these issues by leveraging domain knowledge expressed through
user preferences, ontologies, functional dependencies, and Data Imputation Rules (DIRs), that is
rules that guide how missing values are filled out. Our goals include:</p>
      <p>1) Ontology-enriched usage of DIRs for missing data imputation. Many data analysis algorithms
require complete datasets, often discarding incomplete records. Data imputation replaces
missing values, aiding analytics and aggregate queries. Existing methods rely solely on raw data.
We propose missing data imputation techniques within OBDA, leveraging DIRs to incorporate
semantic knowledge that guides the replacement of missing values in a meaningful way.</p>
      <p>2) Preference-based resolution of inconsistencies. Preferences help extract meaningful
information from inconsistent data, e.g., when one source is more reliable than another one or when
more recent facts are preferred over earlier ones. When coupled with ontologies, one should be
able to express preferences also on information derivable from the data via the ontology. Also,
it is important to consider that users have diferent preferences under diferent circumstances,
and in ontological settings knowledge not known in advance can afect which contextual
preferences should be applied. We develop a framework to manage inconsistencies with
contextual preferences, which requires formalizing preferences while balancing expressiveness and
computational complexity, and establishing their impact on query answering.</p>
      <p>3) Detecting and correcting bias in data. Fairness is essential in training data for ML models,
to ensure that one can trust the outcome of the process. Most existing work assesses the fairness
of the employed analysis algorithms, but coherently with the S-PIC4CHU aim, we focus on
ensuring fair input data. Approximate Conditional Functional Dependencies (ACFDs) [17]
and data mining techniques [18] help uncover biases, identifying attribute correlations that
influence decisions (e.g., job recommendations based on gender or ethnicity). We propose a
framework for bias discovery based on mining ACFDs and possibly other kinds of dependencies
that take into account the domain semantics. This allows the system to discover bias that would
not be found otherwise. We also propose new evaluation metrics for bias in input data. For
multi-media (MM) data, biases arise from spatial (e.g., images) and temporal (e.g., audio/video)
correlations, which cause a violation of the assumption of independently identically distributed
data and lead to overfitting training data [ 38]. We investigate semantic enrichment of MM data
via features characterizing their content, to exploit alternative DSAs (like 3D CNNs) able to
deal with existing data correlations.</p>
      <p>Data Reduction. Feature selection [21] removes irrelevant features that introduce noise,
bias, and computational overhead, a critical issue especially for MM data, as illustrated by the
US Army’s neural network experiment on camouflaged tanks [ 39]. Given the variety of object
descriptions, optimal feature sets vary by media type and DS task. Indeed, raw MM data is
inappropriate for DSA algorithms without a careful selection to prevent biased learning. We
propose a software framework for (semi-automated) feature selection tailored to specific use
cases. Moreover, we consider the relevant example of real-time massive MM stream analysis,
where very low latency for real-time results is essential.</p>
      <p>Object selection [22] ensures high-quality, error-free input to DSAs, preventing wasted
resources and biased models. We explore ℱ -dominance-based methods to analyze ranking
functions’ impact on diversity and fairness [40]. A key challenge is leveraging semantic
information for cross-source object matching, to enable distributed ℱ -dominance algorithms [41].
Our techniques also support advanced analysis of classifier outputs, revealing correlations
between data variables and predictions. Additionally, we examine the relationship between
top-k queries (based on ranking functions) and skyline queries (using Pareto-dominance), both
aimed at selecting relevant objects. Understanding the “discovery power” of top-k queries,
i.e., their ability to retrieve skyline objects, helps assess dataset suitability for analysis tasks,
providing decision-makers with valuable insights.</p>
      <p>Data Provenance and Explanation. Data provenance and explanation serve two purposes:
explaining positive query answers by enriching results with provenance details (e.g., creation as
a new derived feature, transformation, or deletion) and explaining negative query answers via
abduction, identifying information to be added to derive some missing answer. In the presence
of inconsistency, such tasks rely on a repair-based semantics.</p>
      <p>
        For positive query answers in OBDA, existing methods extend the semiring approach for
relational DBs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], taking into account both the ontology and mappings [35]. Such methodology,
however, imposes strong assumptions on the provenance semirings, which restrict the forms of
derivable explanations. We aim to enhance the expressiveness of explanations by incorporating
ontological deduction steps, considering diverse data sources (e.g., graph-structured data,
temporally annotated data), and also the stages in the DPP and the corresponding stages in the STP.
For the latter, we exploit the techniques developed in Item 4 of semantic enrichment, to provide
explanations also in terms of the data at any stage of the DPP, including the original data.
      </p>
      <p>For negative query answers, existing explanations are limited. We plan to explore new notions
based on alternative minimality criteria and preference structures, also addressing diferent
inconsistency-tolerant semantics. In addition, we properly take into account semantic mappings
to STP stages, building on Item 4 of semantic enrichment and on [37].</p>
    </sec>
    <sec id="sec-4">
      <title>4. Case Studies and Experimentation</title>
      <p>The S-PIC4CHU approach will be validated on two use cases drawn from diferent domains to
ensure generality and efectiveness.</p>
      <sec id="sec-4-1">
        <title>4.1. Health Data Curation</title>
        <p>Nowadays, the healthcare organizations are increasingly adopting data-centric architectures,
to be more efective in clinical research: this is achievable by going beyond the conventional
hospital boundaries, to tap into datasets made available by other hospitals. With a focus on the
Italian reality, we aim to leverage semantic-based data management and the S-PIC4CHU Data
Preparation Pipeline to support multicentric clinical trials: the Policlinico Universitario Agostino
Gemelli, the second-largest hospital in Italy, provides various health data sources, including
ambulatories, hospitalizations, and drug data, and also contributes with feedback from their
domain experts. Specifically, attention should be paid to the pipeline for preparing the data used
by data lake platforms for data sharing among diferent hospitals according to the FAIR principles
(Findable, Accessible, Interoperable, Reusable) [42]. This requires (i) data acquisition and
curation to mediate the diferent data representations used by the various hospitals, (ii) accurate
metadata management for enabling subsequent data discovery and comparison, (iii) an access
control model satisfying the constraints imposed by personal data protection regulations,
(iv) novel analysis tools, such as AI-based approaches, pursuing increasingly flexible and scalable
solutions. The first two steps are the ones taken into account within S-PIC4CHU. Indeed, health
data comes in various forms, requiring specific processing to extract valuable information.
Macro categories include biosignals, bioimages, “omic” data (e.g., genomics and proteomics),
and textual documents. Diferences in structure and content pose several challenges to the
design of a platform performing data integration from disparate hospitals or research centers,
and to reach this goal data interoperability is a key target. Syntactic interoperability is already a
challenge, due to a variety of formats and structures of the data, but Semantic Interoperability is
much more important to ensure data trustworthiness and reliable interpretation of the analysis.
The various, already existing Medical Ontologies or Vocabularies, such as SNOMED CT1 or
LOINC2, help unambiguously identify variable meanings, along with harmonization pipelines
to minimize variability in measurements collected across diferent centers. The S-PIC4CHU DPL
is a precious support in this work, since the envisaged semantic techniques and the improved
the data preparation tools provided by the project assist in reconciling conflicts among diferent
medical data standards.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Architecture and Sustainable Development</title>
        <p>This use case, developed with the IMM Design Lab at PoliMI3, aims to support urban
policymakers using the Integrated Modification Methodology (IMM) [ 43], aligning with European
Sustainable Development Goals (SDGs)4. The goal is to create open-source tools compatible
1https://www.snomed.org/
2https://loinc.org/
3http://www.immdesignlab.com/
4https://unric.org/it/agenda-2030/
with the S-PIC4CHU reference architecture, serving as a testbed for the techniques discussed
in Section 3. The experimental phase evaluates the applicability, robustness, eficiency, and
efectiveness of the proposed solutions.</p>
        <p>Data collection for urban environments is challenging due to the lack of unified standards,
semantic ambiguities, varied data collection scales, and the absence of agreed-upon ontologies
and benchmarks. The proposed pipeline addresses these issues, focusing on dataset creation for
urban analysis. Data Acquisition: Data on urban parameters, including geospatial layouts
and environmental conditions are collected from municipalities, open-source platforms, and
IMM files. Using platform APIs and custom parsers, data collection is standardized, formats are
unified, and georeferencing ensures spatial alignment and compatibility through PostgreSQL,
PostGIS, and GIS tools. Data Curation: Raw datasets are refined through feature engineering,
introducing derived variables and spatial indices using PostGIS functions. Ontology-based
transformations address semantic inconsistencies, aligning terms and metrics for coherent analysis.
Data Integration: This phase harmonizes data by resolving granularity and semantic
discrepancies, ensuring spatial and temporal coherence. Techniques include temporal alignment of
datasets with diferent resolutions, spatial integration of demographic and environmental layers,
and ontology-based management of multi-scale datasets from district to city levels. Database
Implementation: Data is stored in a scalable, relational database optimized for geospatial
analysis. The database supports advanced queries, such as identifying low-accessibility areas or
mapping urban heat islands, with ontology-based enhancements for improved adaptability and
usability. Utilization and Analysis: The pipeline uses the database for predictive modeling,
feature analysis, and visualization. Models analyze relationships between variables, such as
trafic congestion’s impact on air quality. Feature-importance analysis highlights key drivers of
urban trends, ofering actionable insights.</p>
        <p>In conclusion, the pipeline standardizes data formats, harmonizes semantics, and integrates
datasets across scales, ensuring cohesive and contextual relevant outputs. It supports IMM
indicator computation and integrates findings with other research domains, leveraging
ontologybased methods and geospatial tools for semantic alignment and precision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The S-PIC4CHU project aims to revolutionize the field of data preparation by incorporating
semantic techniques throughout the entire process. The project addresses the critical issues of
data quality, bias, and explainability, and intends to deliver innovative models and techniques to
overcome the limitations of existing tools. By combining research from diferent domains, the
S-PIC4CHU project is set to make a significant contribution to both the scientific community and
society as a whole, with the potential to have an economic impact. The project is committed to
dissemination, and a range of activities undertaken to ensure the widest possible dissemination of
our results, reaching a broad cross-section of computer science researchers and IT practitioners.
By promoting awareness of data fairness, the project intends to ensure its impact on society
with fairness-aware tools and methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Italian Ministry of University and Research (MUR) PRIN 2022
grant 2022XERWK9 “S-PIC4CHU - Semantics-based Provenance, Integrity, and Curation for
Consistent, High-quality, and Unbiased data science”.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[12] W. Lin, C. Tsai, Missing value imputation: A review and analysis of the literature
(20062017), Artificial Intelligence Review 53 (2020) 1487–1509.
[13] S. Song, A. Zhang, L. Chen, L. Wang, Enriching data imputation with extensive similarity
neighbors, Proc. of the VLDB Endowment 8 (2015) 1286–1297.
[14] M. Bienvenu, C. Bourgaux, F. Goasdoue, Querying inconsistent description logic knowledge
bases under preferred repair semantics, in: Proc. of the AAAI Conf. on Artificial Intelligence
(AAAI), 2014, pp. 996–1002.
[15] S. Staworko, J. Chomicki, J. Marcinkowski, Prioritized repairing and consistent query
answering in relational databases, Ann. of Mathematics and Artificial Intelligence 64 (2012)
209–246.
[16] M. Calautti, S. Greco, C. Molinaro, I. Trubitsyna, Preference-based inconsistency-tolerant
query answering under existential rules, in: Proc. of the Int. Conf. on Principles of
Knowledge Representation and Reasoning, 2020, pp. 203–212.
[17] L. Caruccio, V. Deufemia, G. Polese, Relaxed functional dependencies - a survey of
approaches, IEEE Trans. on Knowledge and Data Engineering 28 (2016) 147–165.
[18] M. Mazuran, E. Quintarelli, L. Tanca, S. Ugolini, Semi-automatic support for evolving
functional dependencies, in: Proc. of the 19th Int. Conf. on Extending Database Technology
(EDBT), 2016, pp. 293–304.
[19] C. O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens</p>
      <p>Democracy, Penguin Books, 2016.
[20] J. Stoyanovich, B. Howe, H. Jagadish, Responsible data management, Proc. of the VLDB</p>
      <p>Endowment 13 (2020) 3474–3488.
[21] M. Bedo, P. Ciaccia, D. Martinenghi, D. de Oliveira, A k-Skyband approach for feature
selection, in: Proc. of the 12th Int. Conf. on Similarity Search and Applications (SISAP),
volume 11807 of Lecture Notes in Computer Science, Springer, 2019, pp. 160–168.
[22] P. Ciaccia, D. Martinenghi, Directional queries: Making top-k queries more efective in
discovering relevant results, Proc. of ACM Management of Data 2 (2024) 232:1–232:26.
doi:10.1145/3698807.
[23] P. Ciaccia, D. Martinenghi, Reconciling skyline and ranking queries, Proc. of the VLDB</p>
      <p>Endowment 10 (2017) 1454–1465.
[24] P. Ciaccia, D. Martinenghi, Flexible skylines: Dominance for arbitrary sets of monotone
functions, ACM Trans. on Database Systems 45 (2020) 1–45.
[25] J. Han, M. Kamber, J. Pei, Data mining: Concepts and techniques, in: Data Mining Trends
and Research Frontiers, Elsevier, 2016.
[26] I. Bartolini, M. Patella, Windsurf: The best way to SURF (and SIFT/BRISK/ORB/FREAK,
too), Multimedia Systems 24 (2018) 459–476.
[27] I. Bartolini, M. Patella, A general framework for real-time analysis of massive multimedia
streams, Multimedia Systems 24 (2018) 391–406.
[28] P. Senellart, L. Jachiet, S. Maniu, Y. Ramusat, ProvSQL: Provenance and probability
management in PostgreSQL, Proc. of the VLDB Endowment 11 (2018) 2034–2037.
[29] A. Borgida, D. Calvanese, M. Rodriguez-Muro, Explanation in the DL-Lite family of
description logics, in: Proc. of On the Move to Meaningful Internet Systems: Confederated
Int. Conf. (OTM), 2008, pp. 1440–1457.
[30] C. Bourgaux, A. Ozaki, Querying attributed DL-Lite ontologies using provenance semirings,
in: Proc. of the AAAI Conf. on Artificial Intelligence (AAAI), 2019, pp. 2719–2726.
[31] D. Calvanese, M. Ortiz, M. Simkus, G. Stefanoni, Reasoning about explanations for negative
query answers in DL-Lite, J. of Artificial Intelligence Research 48 (2013) 635–669.
[32] I. Ceylan, T. Lukasiewicz, E. Malizia, A. Vaicenavicius, Explanations for query answers
under existential rules, in: Proc. of the Int. Joint Conf. on Artificial Intelligence, 2019, pp.
1639–1646.
[33] I. Ceylan, T. Lukasiewicz, E. Malizia, C. Molinaro, A. Vaicenavicius, Explanations for
negative query answers under existential rules, in: Proc. of the Int. Conf. on Principles of
Knowledge Representation and Reasoning, 2020, pp. 223–232.
[34] T. Lukasiewicz, E. Malizia, C. Molinaro, Explanations for inconsistency-tolerant query
answering under existential rules, in: Proc. of the AAAI Conf. on Artificial Intelligence
(AAAI), 2020, pp. 2909–2916.
[35] D. Calvanese, D. Lanti, A. Ozaki, R. Penaloza, G. Xiao, Enriching ontology-based data
access with provenance, in: Proc. of the Int. Joint Conf. on Artificial Intelligence, 2019, pp.
1616–1623.
[36] D. Calvanese, A. Gal, D. Lanti, M. Montali, A. Mosca, R. Shraga, Conceptually-grounded
mapping patterns for Virtual Knowledge Graphs, Data and Knowledge Engineering 145
(2023) 102157. doi:10.1016/j.datak.2023.102157.
[37] D. Calvanese, T. Kalayci, M. Montali, A. Santoso, W. van der Aalst, Conceptual schema
transformation in ontology-based data access, in: Proc. of the Int. Conf. on Knowledge
Engineering and Knowledge Management (EKAW), 2018, pp. 50–67.
[38] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition,</p>
      <p>IEEE Trans. on Pattern Analysis and Machine Intelligence 35 (2012) 221–231.
[39] H. Dreyfus, S. Dreyfus, What artificial experts can and cannot do, AI &amp; Society 6 (1992)
18–26.
[40] A. Gale, A. Marian, Explaining ranking functions, Proc. of the VLDB Endowment 14 (2021)
640–652.
[41] P. Ciaccia, D. Martinenghi, FA+TA&lt;FSA: Flexible score aggregation, in: Proc. of the ACM</p>
      <p>Int. Conf. on Information and Knowledge Management (CIKM), 2018, pp. 57–66.
[42] R. D. Kush, et al., FAIR data sharing: The roles of common data elements and harmonization,</p>
      <p>J. of Biomedical Informatics 107 (2020) 103421. doi:10.1016/j.jbi.2020.103421.
[43] T. Massimo, Integrated Modification Methodology (IMM): A phasing process for sustainable
urban design, WASET World Academy of Science Engineenering and Technology. 77 (2013).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Ethical dimensions for data quality</article-title>
          ,
          <source>ACM J. Data Inf. Qual</source>
          .
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <article-title>2:1-2:5</article-title>
          . doi:
          <volume>10</volume>
          .1145/3362121.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          , J. Han,
          <article-title>Democratizing data science through interactive curation of ML pipelines</article-title>
          , in
          <source>: Proc. of the 40th ACM Int. Conf. on Management of Data (SIGMOD)</source>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1171</fpage>
          -
          <lpage>1188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmelzeisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <volume>71</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>71</lpage>
          :
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <source>Data Quality: Concepts</source>
          ,
          <source>Methodologies and Techniques, Data-Centric Systems and Applications</source>
          , Springer,
          <year>2006</year>
          . URL: https://doi.org/10.1007/ 3-540-33173-5. doi:
          <volume>10</volume>
          .1007/3-540-33173-5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>The semiring framework for database provenance</article-title>
          ,
          <source>in: Proc. of the ACM Symp. on Principles of Database Systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>99</lpage>
          . doi:
          <volume>10</volume>
          .1145/3034786. 3056125.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Storey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lukyanenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Maass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parsons</surname>
          </string-name>
          ,
          <string-name>
            <surname>Explainable</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <source>Communications of the ACM</source>
          <volume>65</volume>
          (
          <year>2022</year>
          )
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          , G. Simonelli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Capturing and querying fine-grained provenance of preprocessing pipelines in data science</article-title>
          ,
          <source>Proc. of the VLDB Endowment</source>
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>507</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Buratti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Eichmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          , E. Upfal, T. Kraska,
          <article-title>Democratizing data science through interactive curation of ML pipelines</article-title>
          , in
          <source>: Proc. of the ACM Int. Conf. on Management of Data (SIGMOD)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1171</fpage>
          -
          <lpage>1188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          , Foundations of Semantic Web Technologies, Chapman &amp; Hall/CRC,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patel-Schneider</surname>
          </string-name>
          (Eds.),
          <source>The Description Logic Handbook: Theory, Implementation and Applications</source>
          , 2nd ed., Cambridge University Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>J.-M. Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hogan</surname>
          </string-name>
          , T. Käfer, BTC-2019:
          <article-title>The 2019 Billion Triple Challenge dataset</article-title>
          ,
          <source>in: Proc. of the Int. Semantic Web Conf. (ISWC)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -30796-7_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>