<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>S-PIC4CHU: Semantics-Enriched Techniques for Data Preparation in Data Science</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianvincenzo Alfano</string-name>
          <email>g.alfano@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilaria Bartolini</string-name>
          <email>ilaria.bartolini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Calvanese</string-name>
          <email>diego.calvanese@unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Ciaccia</string-name>
          <email>paolo.ciaccia@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Greco</string-name>
          <email>greco@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Lanti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Leonardo Lazzaro</string-name>
          <email>pasqualeleonardo.lazzaro@uniroma3.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilia Lenzi</string-name>
          <email>emilia.lenzi@polimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Martinenghi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristian Molinaro</string-name>
          <email>c.molinaro@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Patella</string-name>
          <email>marco.patella@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Letizia Tanca</string-name>
          <email>letizia.tanca@polimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <email>riccardo.torlone@uniroma3.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Trubitsyna</string-name>
          <email>i.trubitsyna@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum University of Bologna</institution>
          ,
          <addr-line>DISI, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano, Faculty of Engineering</institution>
          ,
          <addr-line>Bolzano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Politecnico di Milano, DEIB</institution>
          ,
          <addr-line>Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Roma Tre University</institution>
          ,
          <addr-line>DICITA, Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Calabria, DIMES</institution>
          ,
          <addr-line>Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>The S-PIC4CHU project deals with the crucial issue of data preparation for Data Science and Machine Learning, and aims to ofer new models and techniques for fighting inaccuracy, noise, uncertainty, bias, and incompleteness of data. While, at the core, the project embraces a semantics-based approach, the proposed data preparation pipeline includes data cleaning -also from the ethical viewpoint-, transformation, reduction as well as deduplication, error detection, missing value imputation, and space transformations for multimedia data. This paper illustrates the advancements on all these fronts, achieved during the first months of work on the project, and sets out the forthcoming actionable objectives.</p>
      </abstract>
      <kwd-group>
        <kwd>graphs</kwd>
        <kwd>provenance</kwd>
        <kwd>explanation</kwd>
        <kwd>bias</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The efectiveness of data-driven applications critically depends on the quality of the data they
consume. Yet, in most real-world scenarios, data is rarely “clean”: it often sufers from inaccuracy, noise,
incompleteness, duplication, ethical problems, and inconsistencies, which can significantly hinder
downstream analytical or learning tasks. Despite the growing sophistication of machine learning
algorithms and data analytics platforms, the preparatory steps that bring raw data to a usable state
remain complex, labor-intensive, and error-prone. As a result, data preparation has emerged as one of
the most resource-demanding and mission-critical stages in modern data science workflows.</p>
      <p>Traditional data preparation techniques are typically tailored to specific tasks, such as missing
value imputation, outlier detection, or deduplication, and often rely on heuristic or statistical models
that lack generalizability and transparency. Moreover, the increasing diversity and heterogeneity of
data sources — including structured, semi-structured, unstructured, and multimedia formats — further
complicate the design of unified, scalable preparation pipelines. In this context, semantics-based
approaches ofer a promising direction by enabling a higher-level understanding of the meaning and
structure of data, allowing for more robust and explainable interventions.</p>
      <p>CEUR
Workshop
Proceedings</p>
      <p>ceur-ws.org</p>
      <p>ISSN1613-0073
D0</p>
      <p>ST0
DP0</p>
      <p>D1
M0</p>
      <p>M1</p>
      <p>ST1−2</p>
      <p>D2
DP1</p>
      <p>DP2</p>
      <p>M3
D3</p>
      <p>ST3
DP3
D4</p>
      <p>M4</p>
      <p>ST4 . . . STn−1
DP4 . . . DPn−1
Dn</p>
      <p>Mn</p>
      <p>This paper presents the S-PIC4CHU project, which addresses the challenge of building a
comprehensive and semantically-informed data preparation pipeline that systematically tackles various quality
issues across multiple data modalities. The proposed pipeline incorporates modules for data cleaning,
transformation, dimensionality reduction, deduplication, error detection, missing value imputation,
and space transformations for multimedia data. The key innovation lies in the integration of semantic
models and domain knowledge across the entire preparation workflow, which allows for more accurate
resolution of inconsistencies and a principled handling of uncertainty.</p>
      <p>We report on the initial advancements and design decisions resulting from the early development
phases of this approach. These include novel strategies for schema-driven imputation, semantics-aware
record linkage, and embedding-based error detection in multimedia datasets. We also outline the
forthcoming objectives of our work, which include the extension of the framework to support adaptive
preparation strategies and the evaluation of its efectiveness across diverse real-world domains.</p>
      <p>By laying the groundwork for a unified semantics-oriented data preparation framework, this work
contributes to the broader efort of making data science pipelines more transparent, automated, and
reliable — particularly in high-stakes settings where data quality directly impacts the analytical outcomes.</p>
      <p>The paper is organized as follows. Section 2 introduces a Semantic Transformation Pipeline that
enriches the traditional Data Preparation Pipeline. Section 3 discusses how ontologies can be profitably
used to deal with inconsistent and/or incomplete data. Section 4 builds on a conceptual model for
expressing contexts and ethical requirements, aimed at achieving a context-aware approach to be
adopted in the pipeline. Section 5 presents a provenance infrastructure for capturing and querying
detailed data lineage at the level of individual dataset elements. Section 6 discusses the use of ranking
techniques to retrieve results that are balanced with respect to user preferences and fairness, while also
ofering insight into parallel execution. Finally, Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Semantics-Enriched Data Preparation</title>
      <p>The S-PIC4CHU project aims to deliver scalable, semantically-aware data preparation techniques that
improve the quality, traceability, and explainability of machine learning workflows. Central to this
vision is the development of a Semantic Transformation Pipeline (STP), which complements and enriches
the traditional Data Preparation Pipeline (DPP). The STP captures, at a semantic level, both the data and
the transformations occurring at each DPP stage that is conceptually significant. This is achieved by
linking through semantic mappings each such DPP stage to a corresponding ontology. The resulting
conceptual architecture is illustrated in Figure 1, where we have assumed that, e.g.,  2 is a DPP stage
that is not conceptually significant, and therefore it is not mapped to a corresponding stage in the STP.
The mappings allow each STP stage to be viewed as a Virtual Knowledge Graph (VKG) over the data,
enabling powerful capabilities for provenance tracking, data quality management, and bias detection.</p>
      <p>A major challenge in realizing this vision has been the automation of ontology construction and
mapping derivation across pipeline stages. To address this, S-PIC4CHU is building on a catalog of
mapping patterns, inspired by well-established database design practices (e.g., primary-foreign key
structures, naming conventions, and schema normalization principles). These patterns provide reusable
templates for systematically transforming relational database schemas into semantically-rich ontological
representations, therefore aligning intermediate DPP stages to their semantic counterparts in the STP.</p>
      <p>
        A key milestone of the project has been the development and validation of LLM4VKG, a framework to
automate VKG construction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LLM4VKG leverages Large Language Models (LLMs) to operationalize
the pattern catalog and assist in two critical tasks:
• Mapping Suggestion: Given a relational schema and an initial target ontology, LLM4VKG uses
cataloged patterns to guide the LLM in proposing instance-level mappings between relational
elements and ontology concepts (both classes and properties), especially in the presence of naming
ambiguities or incomplete ontologies.
• Ontology Enrichment: When the initial ontology lacks relevant concepts to describe the data,
LLM4VKG employs the LLM to suggest ontology extensions that align with common design
patterns and the underlying data semantics.
      </p>
      <p>Importantly, the LLM is not learning the patterns but acts in the role of aligning the ontology prescribed
by the curated pattern catalog to the target ontology. This hybrid design ensures both pattern fidelity
and flexibility in mapping generation.</p>
      <p>
        The efectiveness of LLM4VKG has been demonstrated through an extensive evaluation on the RODI
benchmark, a standard suite for testing relational-to-ontology mapping tools [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. LLM4VKG achieved
an average F1-score improvement of +17% over state-of-the-art baselines, with peak gains reaching
+39% in the most challenging scenarios. Moreover, LLM4VKG exhibited robustness to incomplete
ontologies and could handle complex mapping tasks that are representative of real-world DPP stages in
S-PIC4CHU.
      </p>
      <p>By integrating LLM4VKG into the S-PIC4CHU architecture, we have already advanced some key
objectives of the project:
1. semi-automatic ontology construction for STP stages;
2. derivation of semantic mappings from relational DPP stages using a pattern-based, LLM-assisted
approach;
3. support for the automated generation of provenance links and explanations, by formally
connecting intermediate and final data stages back to their semantic definitions and source data.</p>
      <p>These achievements have laid the foundation and mark a significant step towards delivering
opensource, semantically-driven tools for scalable data preparation in data science and machine learning
workflows, fully aligned with the goals of the S-PIC4CHU project. A key step that still needs to be
explored in the project remains the abstraction of data preparation operations into semantic transformation
steps within the STP.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Incompleteness and Inconsistency</title>
      <p>In the context of real-world data preparation, incompleteness and inconsistency are not exceptions,
they are the norm. Incompleteness in datasets arises when certain values or observations are missing,
partially recorded, or entirely unavailable, while inconsistency refers to data not complying with
constraints expressing the application semantics. In such settings, ontologies can provide valuable
knowledge to guide how incompleteness and inconsistency should be resolved in a meaningful way.
Our goal is to take into account diferent kinds of knowledge on the application domain expressed in
terms of user preferences, ontologies, data constraints, and data imputation rules (that is, rules that
guide how missing values are filled out).</p>
      <sec id="sec-3-1">
        <title>3.1. Ontology-enriched Data Imputation Rules</title>
        <p>
          An approach to deal with incomplete data is the imputation of missing data, which consists in replacing
missing values with concrete ones [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ]. This approach proves very useful when the subsequent
tasks regard data analytics or aggregate queries. Algorithms in this area can be divided into two main
categories: statistical algorithms and machine learning algorithms. However, current algorithms deal
only with raw datasets, that is, without additional knowledge that comes with them. We propose
missing data imputation techniques in the presence of ontologies, leveraging data imputation rules that
incorporate semantic knowledge on the application at hand to guide the process of replacing missing
values in a meaningful way.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preferences to Resolve Data Inconsistencies</title>
        <p>
          Expressing preferences is natural and desirable in many applications, e.g., when one data source is more
reliable than another one, or when more recent facts are preferred over earlier ones. In the presence of
inconsistent data, preferences help refine the “consistent” information we can extract from inconsistent
data sources, ruling out undesirable results. When data sources are accompanied by ontologies providing
knowledge on the application domain, it is relevant to be able to express preferences on information
not directly available in the data, but derivable from it via the ontology. An additional important
aspect stems from the observation that, in the real world, preferences may not hold always, but may
depend on several underlying factors. Most often, users have diferent preferences under diferent
circumstances (think, e.g., of personalized e-commerce applications, where one’s preferences may
change based on location, time, weather, etc.). The inclusion of contextual preferences is particularly
useful in ontological settings [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], where part of the knowledge is not known in advance, but it can well
afect which preferences should be applied. We envisage a framework able to manage inconsistent
information under user preferences that incorporates the aforementioned features. Such goals require
introducing suitable formalisms to express preferences, balancing expressiveness and complexity, and
establishing their impact on query answering.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Ethics and Context-Awareness</title>
      <p>
        Ethics has become a major concern to the information management community, as both algorithms and
data should satisfy ethical rules that guarantee not to generate dishonourable behaviours when they
are used. However, we should also take into account that the ethical rules may vary according to the
situation — i.e., the context — in which the application programs must work. Therefore, we are working
on a bipartite conceptual model, composed of the Context Dimensions Tree (CDT) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a conceptual
model for describing the possible contexts, and the Ethical Requirements Tree (ERT), which describes
the ethical rules necessary to tailor and preprocess the datasets that should be fed to data analysis and
learning systems in each possible context. The results of this research are collected in a paper that is
currently under review.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Other Uses of Context-awareness</title>
        <p>We plan to apply context-awareness to other aspects of the data preparation pipeline, related to data
format and data quality. In this direction, we plan to use a similar framework beyond ethics, to include
data quality dimensions, both at a theoretical level and through the application to concrete case studies.
A first use case will be developed with the IMM Design Lab at PoliMI 1, aiming to support urban
policymakers using the Integrated Modification Methodology (IMM) [ 7], aligning with European Sustainable
Development Goals (SDGs)2. The goal is to create open-source tools that span data acquisition, curation,
integration, and analysis addressing semantic and spatial inconsistencies, and serve as a testbed for the
1http://www.immdesignlab.com/
2https://unric.org/it/agenda-2030/
S-PIC4CHU architecture, where context-aware ethical and quality requirements can be operationalized
and evaluated.</p>
        <p>When multimedia data are concerned, CDT and ERT can be efectively exploited to pre-process data
according to the task at hand. Indeed, the ubiquitous use of deep neural embedding models to describe
multimedia data produces representations consisting of hundreds, if not thousands, of dimensions.
However, it is common practice to select a reduced number of such dimensions (feature selection) or to
map representations to a subspace (feature engineering) with the goal of simplifying the next steps in
the pipeline and removing redundancy and/or irrelevance among embedding dimensions [8]. As it can
be easily argued, the choice of the actual feature selection/engineering technique to be used depends
on both the dataset at hand and the task to be applied (classification, retrieval, and so on), for example,
because a specific dimension can be relevant for a particular task and irrelevant for another one. In
this light, the goal of the data analytics pipeline represents the context that can guide the application
of diferent data-preparation algorithms on multimedia data embeddings, thus justifying the use of
CDT and ERT for multimedia data. In addition, the data analytics task can be enriched to use not only
the (appropriately pre-processed) embedded model, but also existing (orthogonal) semantic dimension
trees, so as to improve its efectiveness (to bridge the semantic gap) and eficiency (as an additional
ifltering step) [ 9, 10].</p>
        <p>During the project, we plan to apply the CDT framework to a heterogeneous multimedia case study,
where diferent data modalities (e.g., images, video, and audio) and use cases will require
contextspecific selection of relevant features and priorities. This will allow us to validate the CDT as a guiding
structure not only for ethical filtering but also for adapting the pipeline to domain-specific constraints
in multimodal data preparation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Data Provenance</title>
      <p>Reliable data-driven science relies heavily on data pipelines that transform raw inputs into machine
learning-ready datasets. Each transformation step may influence the outcome significantly, yet current
explainability research mostly focuses on models, not on how the data was shaped before training. This
is problematic, as preprocessing may inadvertently introduce bias or distort patterns, undermining trust
in results. Our goal is to enable a fine-grained understanding of how each preprocessing step impacts
the data, aligning with the S-PIC4CHU project’s focus on semantics, quality, and explainability.</p>
      <sec id="sec-5-1">
        <title>5.1. Data and Provenance Models</title>
        <p>To support this, we designed a provenance infrastructure for capturing and querying detailed data
lineage at the level of individual dataset elements [11]. Data is modeled as two-dimensional tabular
structures (dataframes), with features as columns and records as rows. Preprocessing operations are
categorized into four groups:
• Data reductions: These operations decrease the size of a dataset by eliminating rows (e.g., instance
selection) or columns (e.g., feature selection).
• Data augmentations: These operations increase the size of a dataset by adding rows (e.g., record
augmentation) or columns (e.g., feature augmentation).
• Data transformations: We define these as operations that modify existing elements in the dataset
without altering its overall size or schema (e.g., imputation, normalization, binarization).
• Data fusions: These operations combine two or more datasets (e.g., join, append).</p>
        <p>Provenance is captured using a graphical model in which the nodes are called Entities and denote
specific data elements, uniquely identified by dataset, row, and column. Activities represent
transformations, and their relationships are modeled using links between entities like wasGeneratedBy,
wasDerivedFrom, and wasInvalidatedBy. Each transformation generates a compact “provlet”
document per data element, which can be assembled into a full provenance graph.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Provenance Generation and Implementation</title>
        <p>Our solution observes data changes to infer provenance, without requiring internal access to
transformation logic, which is ideal for black-box or composite operations. The algorithm detects structural
and value changes between input and output dataframes, then applies PROV templates to model
dependencies. For example, one-hot encoding is captured as a vertical augmentation followed by a
projection.</p>
        <p>We implemented this approach in Python using pandas for data handling and Neo4j as the graph
store [12]. Dataframes are wrapped using an Observer pattern to enable automatic provenance capture
during transformations. Expensive tasks are parallelized using multiprocessing. Eficient join tracking
is achieved with hash-based indexing to avoid costly scanning.</p>
        <p>A primary limitation often encountered in traditional data provenance systems, is the excessive
volume of provenance data, which can severely complicate the readability and interpretability of the
provenance graph. We efectively addressed this challenge by enabling customization of the level of
granularity at which provenance is collected and queried. This multi-granular view balances between
lfexibility, eficiency, and detail in provenance analysis.</p>
        <p>Finally, using a Retrieval-Augmented Generation approach, an LLM [13] is used to translate
userdefined natural language questions into executable queries over the collected data provenance graph [ 14].
The query results are then contextualized by the LLM to generate user-friendly, textual narratives over
the provenance.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation and Insights</title>
        <p>We validated our approach on both real and synthetic pipelines (e.g., German Credit, COMPAS, Census,
TPC-DI) [11]. The system supports a rich suite of provenance queries, including classic “Why,” “How,”
and “Why Not” questions, as well as new ones like generate: “All Transformations,” “Item History,”
and “Impact on Feature/Dataset Spread”. These help detect changes in data distribution that may afect
fairness or introduce bias.</p>
        <p>Performance-wise, overhead introduced by provenance capture is modest: about 1.4–1.8 seconds for
medium pipelines and under 30 seconds for larger ones. Complex operations like one-hot encoding
produce more entities, increasing overhead, but overall, the scalability is good. Simple graph queries
execute rapidly, while complex lineage traversals require more processing. A web-based interface
allows users to explore transformations and inspect before/after data states, aiding debugging and trust.</p>
        <p>In sum, our system enables detailed introspection of data pipelines by linking each output element
back to its origin. This supports bias detection, fairness auditing, and transparency, key aspects
of responsible AI. By tracing data derivation through every transformation, we empower users to
understand not just how models behave, but why, improving accountability and trust in data science
workflows.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Ranking</title>
      <p>In the context of the activities concerning data reduction, we are also focusing on the relationship
between top- queries (aka ranking queries, based on scoring functions) and skyline queries (based on
Pareto-dominance), both aiming at selecting relevant objects. In particular, we have addressed dataset
partitioning strategies for the computation of such operators, the balance of the tuples in the retrieved
result set, and the fairness thereof.</p>
      <sec id="sec-6-1">
        <title>6.1. Partitioning Strategies for Computing the Skyline</title>
        <p>We have analyzed and experimented with alternative partitioning strategies for computing the skyline
of large datasets, which may turn out to be a challenging computational task [15]. Our results suggest
that even partitioning by using the values of a single attribute is highly efective in reducing costs,
in particular when coupled with a parallelization of the final phase in which the skyline is obtained
by combining the “local” skylines obtained in the diferent partitions. The work [ 16] shows that this
strategy also proves to be efective for the computation of operators, known as flexible skylines , that
extend the skyline and hybridize it with ranking queries [17, 18]. Another line of research regards the
vertical partitioning of data for the so-called middleware scenario, which resulted in classical algorithms
such as Fagin’s Algorithm [19] and the Threshold Algorithm [20]. These algorithms have been extended
and adapted to the case of flexible skylines [ 21], recently including the relevant scenario in which
indices for data access are not available (No Random Access scenario) [22].</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Directional Queries</title>
        <p>We have studied how much top- queries, in particular those using a linear scoring function, are efective
in retrieving non-dominated objects, i.e., those in the skyline [23, 24]. To this end, we introduced four
indicators to measure the dificulty of retrieving skyline points as well as their interestingness (some
of which even allow for a parallelized computation that exploits partitioning [25]). We observed that
relevant, yet hard to retrieve, objects occur in all the many real-world datasets we analyzed. A practical
way to circumvent this problem is represented by a novel type of scoring functions, which yield the
socalled directional queries. Such queries, besides considering the score/utility of an object, also take into
account how the attribute values of an object are “balanced” with respect to the stated user preferences.
Experimental results obtained on both synthetic and real-world datasets demonstrate that directional
queries consistently outperform linear queries (as well as queries using some form of non-linearity) in
terms of cumulative recall, i.e., the fraction of skyline points that are retrieved by a set of top- queries.
Along this line of research, we are also considering the problem of how to retrieve all skyline points
using a minimal number of top- queries, which has practical interest for characterizing the dificulty
of implementing an efective data exploration process.</p>
        <p>Finally, we remark that favoring more balanced results is not in contrast with the requirement of
diversity of the tuples in the result set. Indeed, since the very notion of diversity is independent of the
family of scoring functions one is going to use, we plan to combine it with directional queries into an
integrated framework.
6.3. Fairness of a Top- Set
A prominent aspect of Ethics (see Section 4), which is also very relevant in the context of ranking
queries, is that of ensuring that the result of a top- query also respects some fairness criteria. Since
there is an intrinsic trade-of between the overall utility of a set of objects and the fairness requirements
(maximizing the utility may lead to a biased result, and vice versa), we are now studying models and
algorithms for computing what we call the “UF-skyline” of a dataset, i.e., the skyline consisting of all
the  -sets of objects that are non-dominated with respect to the utility and fairness measures. The
advantage of this approach with respect to the solutions available in the literature is that it provides the
decision maker with the possibility of exploring all the possible trade-ofs between utility and fairness
and, therefore, of making a more informed choice.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this work, we presented the latest advancements of the S-PIC4CHU project, which proposes a
semantic-driven approach to data preparation addressing core challenges such as noise, incompleteness,
inconsistency, bias, and lack of explainability. By integrating semantic layers, ontologies, and contextual
knowledge into traditional pipelines, the project enhances the transparency, traceability, and adaptability
of data workflows. A central milestone is the development of LLM4VKG, a framework that leverages
LLMs to support the semi-automatic construction of VKGs and the derivation of semantic mappings.
Its integration enables scalable, ontology-aware, and explainable data transformations, outperforming
state-of-the-art tools.</p>
      <p>The project also highlights the importance of domain knowledge and user preferences in guiding
imputation and resolving inconsistencies, especially when ethical and contextual dimensions are
relevant. The inclusion of ethical requirements and the use of the Context Dimension Trees ensure
alignment with societal values and application-specific constraints.</p>
      <p>In parallel, S-PIC4CHU delivers a fine-grained provenance system that captures and explains data
transformations at the element level. Successfully tested on benchmark datasets, it enables inspection
queries — such as transformation tracing and fairness impact assessment — while maintaining low
computational overhead. Built on this infrastructure, the project also explores fairness-aware ranking
techniques, such as directional queries and UF-skylines, which support data reduction decisions that
balance utility and equity.</p>
      <p>In conclusion, the project has already achieved important results, including the development of
LLM4VKG for ontology-aware mappings, initial frameworks for imputation and inconsistency resolution,
a provenance infrastructure for capturing and querying detailed data lineage at the level of individual
dataset element, and the first evaluations of fairness-aware ranking techniques. As future work, we
plan to evaluate the entire pipeline — semantic enrichment, provenance tracking, and fairness-aware
ranking — within the domain of multimedia data. In particular, leveraging the CDT, we will tailor
the selection of relevant dimensions to specific applications, testing the robustness, fairness, and
transparency of the S-PIC4CHU architecture in multimodal settings.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the Italian Ministry of University and Research (MUR) PRIN 2022 grant
2022XERWK9 “S-PIC4CHU- Semantics-based Provenance, Integrity, and Curation for Consistent,
Highquality, and Unbiased data science”.</p>
      <p>This work was partially supported by the PNRR project FAIR - Future AI Research (PE00000013),
Spoke 9 - Green-aware AI, under the NRRP MUR program funded by the NextGenerationEU.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
[7] T. Massimo, Integrated Modification Methodology (IMM): A phasing process for sustainable urban
design, WASET World Academy of Science Engineenering and Technology. 77 (2013).
[8] H. Liu, Feature selection, in: C. Sammut, G. I. Webb (Eds.), Encyclopedia of Machine Learning,</p>
      <p>Springer, 2010, pp. 402–406. doi:10.1007/978-0-387-30164-8_306.
[9] I. Bartolini, P. Ciaccia, Scenique: a multimodal image retrieval interface, in: S. Levialdi (Ed.), Proc.
of the working conference on Advanced Visual Interfaces (AVI), ACM Press, 2008, pp. 476–477.
doi:10.1145/1385569.1385664.
[10] I. Bartolini, M. Patella, C. Romani, SHIATSU: tagging and retrieving videos without worries,</p>
      <p>Multim. Tools Appl. 63 (2013) 357–385. doi:10.1007/S11042-011-0948-1.
[11] A. Chapman, L. Lauro, P. Missier, R. Torlone, Supporting better insights of data science pipelines
with fine-grained provenance, ACM Trans. Database Syst. 49 (2024) 6:1–6:42. doi: 10.1145/
3644385.
[12] P. L. Lazzaro, M. Lazzaro, P. Missier, R. Torlone, PROLIT: Supporting the transparency of data
preparation pipelines through narratives over data provenance, in: Proc. of the Int. Conf. on
Extending Database Technology (EDBT), OpenProceedings.org, 2025, pp. 1138–1141.
[13] A. Matarazzo, R. Torlone, A survey on large language models with some insights on their
capabilities and limitations, CoRR abs/2501.04040 (2025). doi:10.48550/arXiv.2501.04040.
arXiv:2501.04040.
[14] L. Gregori, P. L. Lazzaro, M. Lazzaro, P. Missier, R. Torlone, An LLM-guided platform for
multigranular collection and management of data provenance, J. Big Data (2025). To appear.
[15] P. Ciaccia, D. Martinenghi, Optimization strategies for parallel computation of skylines, Distributed
and Parallel Databases (2025). To appear.
[16] E. D. Lorenzis, D. Martinenghi, Partitioning Strategies for Parallel Computation of Flexible Skylines,</p>
      <p>Algorithms 18 (2025). doi:10.3390/a18030141.
[17] P. Ciaccia, D. Martinenghi, Reconciling skyline and ranking queries, Proc. of the VLDB Endowment
10 (2017) 1454–1465.
[18] P. Ciaccia, D. Martinenghi, Flexible skylines: Dominance for arbitrary sets of monotone functions,
ACM Transactions on Database Systems 45 (2020) 18:1–18:45. doi:https://doi.org/10.1145/
3406113.
[19] R. Fagin, Combining fuzzy information from multiple systems, in: Proc. of the ACM Symp. on</p>
      <p>Principles of Database Systems, 1996, pp. 216–226. doi:10.1145/237661.237715.
[20] R. Fagin, A. Lotem, M. Naor, Optimal aggregation algorithms for middleware, in: Proc. of the</p>
      <p>ACM Symp. on Principles of Database Systems, 2001. doi:10.1145/375551.375567.
[21] P. Ciaccia, D. Martinenghi, FA + TA &lt; FSA: Flexible score aggregation, in: Proc. of the ACM
Int. Conf. on Information and Knowledge Management (CIKM), 2018, pp. 57–66. doi:10.1145/
3269206.3271753.
[22] D. Martinenghi, Computing non-dominated flexible skylines in vertically distributed datasets with
no random access, Data 10 (2025). doi:10.3390/data10050076.
[23] P. Ciaccia, D. Martinenghi, Directional Queries: Making Top-k Queries More Efective in
Discovering Relevant Results, Proc. ACM Manag. Data 2 (2024). doi:10.1145/3698807.
[24] P. Ciaccia, D. Martinenghi, Relevant, yet hard to find: Directional queries to the rescue, in:
Proc. of the Italian Symposium on Advanced Database Systems, CEUR Workshop Proceedings,
CEUR-WS.org, 2025. To appear.
[25] D. Martinenghi, Parallelizing the Computation of Grid Resistance to Measure the Strength of
Skyline Tuples, Algorithms 18 (2025). doi:10.3390/a18010029.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ren</surname>
          </string-name>
          , G. Qi,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Panfilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lanti</surname>
          </string-name>
          , LLM4VKG:
          <article-title>Leveraging large language models for virtual knowledge graph construction</article-title>
          ,
          <source>in: Proc. of the Int. Joint Conf. on Artificial Intelligence</source>
          ,
          <year>2025</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pinkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          , W. May,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Sasa</given-names>
            <surname>Bastinos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Skjaeveland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solimando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taheriyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heupel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          , RODI:
          <article-title>Benchmarking relationalto-ontology mapping generation quality</article-title>
          ,
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <year>2018</year>
          )
          <fpage>25</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW- 170268.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shahbazian</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Trubitsyna</surname>
          </string-name>
          , DEGAIN:
          <article-title>Generative-adversarial-network-based missing data imputation</article-title>
          ,
          <source>Information</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>575</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shahbazian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <article-title>Generative adversarial networks assist missing data imputation: A comprehensive survey and evaluation</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>88908</fpage>
          -
          <lpage>88928</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Calautti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Molinaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Trubitsyna</surname>
          </string-name>
          ,
          <article-title>Preference-based inconsistency-tolerant query answering under existential rules</article-title>
          ,
          <source>in: Proc. of the Int. Conf. on Principles of Knowledge Representation and Reasoning</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bolchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Quintarelli</surname>
          </string-name>
          , L. Tanca, CARVE:
          <article-title>Context-aware automatic view definition over relational databases</article-title>
          ,
          <source>Information Systems</source>
          <volume>38</volume>
          (
          <year>2013</year>
          )
          <fpage>45</fpage>
          -
          <lpage>67</lpage>
          . doi:
          <volume>10</volume>
          .1016/J.IS.
          <year>2012</year>
          .
          <volume>05</volume>
          .004.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>