<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Data Management and Value Creation through a Knowledge-centric Data Stack</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Enrico Franconi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Théo Abgrall</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Employee(eoid</institution>
          ,
          <addr-line>ssn,name), works-in(eoid,doid), Department(doid,depname,address) Employee.eoid ⇄ Employee.ssn Employee.ssn</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KRDB Research Centre for Knowledge-based Artificial Intelligence, Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>Italy - https:// krdb.eu</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Nowadays we observe an evolving landscape of data management and analytics, emphasising the signiifcance of meticulous data management practices, semantic modelling, and bridging business-technical divides, to optimise data utilisation and enhance value from datasets in modern data environments. In this paper we introduce and explain the basic formalisation of the Semantic SQL Transducer, a well-founded but practical tool providing the materialised lossless conceptual view of an arbitrary relational source data, contributing to a knowledge-centric data stack.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The landscape of data management and analytics is undergoing continuous evolution, aiming to
optimise data utilisation, ensure governance, and enhance the value derived from the datasets.
Several pivotal concepts shape an enhanced modern data environment, emphasising the
signiifcance of robust data preparation, semantic modelling, and bridging the gap between technical
and business perspectives. By looking to the current trends in data management, key aspects
include the increasing significance of metadata management for data governance, the necessity
of comprehensive semantic enrichment in data contracts and data preparation, the importance of
bridging the divide between business problem models and data domains through the integration
of semantic mediation, the adoption of a semantic-based declarative transformation process,
and the facilitation of seamless data integration and improved interoperability through shared
semantic understanding.</p>
      <p>
        Auditability and Data Governance. Auditability stands as a critical factor in aiding data
analysts to comprehend and model schemas efectively. The emphasis on properly managing
metadata supports efective data governance by encompassing the meticulous organisation,
quality control, and management of key properties like completeness, consistency, fairness,
privacy, provenance, and other data qualities. The focus on enhancing data analyst comprehension
and schema modelling is pivotal, highlighting the paramount importance of serious metadata
management [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Data Preparation. Data preparation serves as the foundational pillar upon which successful
analytics stands. Tasks encompassed within data prep, such as cleaning, parsing, integrity
checks, and data set unification, are crucial. Much like the significance of a solid foundation
for monumental structures, a meticulous data preparation phase is indispensable for accurate
visualisation, reporting, and forecasting, and it is the step where raw data transforms into
actionable insights. Unluckily, the tools that vendors propose for data preparations do not
include a serious role of semantics in the pipeline, but only some shallow semantic enrichment
such as semantic data types [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Shifting Perspectives on Data Utilisation. Historically, the predominant focus in the data
domain has been on downstream tasks like visualisation and reporting. However, contemporary
trends reveal a substantial shift, where data preparation and meticulous data management
consume a significant portion of analysts’ time. Modern tools provide a bridge to streamline
these preparation-intensive tasks, minimising manual labor and enhancing eficiency [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Semantic Modelling and Data-Centric Approaches. The traditional approach to data
warehousing often resulted in tangled data sets, complicating queries and hindering data
trustworthiness. Embracing semantic modelling involves steps like knowledge graphs creation,
engineering ownership, data contracts, and subsequent event implementation. Such
methodologies ensure that data scientists and engineers spend their time defining and utilising high-quality
data, contributing to a more robust and eficient data warehouse. In this context, semantic
enhancement plays a critical role in elevating the value and usability of data within industrial
contexts. Semantic enhancement isn’t merely about structuring data but imbuing it with rich
context, relationships, and meaning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Data Catalogs. Classical data catalogs and data contracts often fall short in capturing the
meaning of the business domain of the companies they are meant to represent. To efectively
serve users, catalogs must act as bridges between the user’s problem model and the underlying
data domain by incorporating semantic understanding. This involves creating ontologies,
structured representations of knowledge that enhance the comprehension of data relationships.
When business users attempt to interpret data, they often lack the semantic context necessary
for a full understanding. They communicate in their familiar business language, while data
appears as a new language tied to the source structures. Here’s where the semantic layer
comes into play. This critical component adds a layer of business context to data stored in
catalogs, essentially translating it into a language that the business can easily comprehend. The
semantic layer also ensures consistency across diverse and heterogeneous data by implementing
a common business logic. This conformity empowers businesses to derive meaning from the
data coming out of the data stack pipeline [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Challenges and Solutions in Modern ETL. Bill Inmon highlights the evolution of ETL
(Extract, Transform, Load) and ELT (Extract, Load, Transform) processes, noting a tendency
among vendors and consultants to shift from ETL to ELT. As a consequence, ELT often places
the burden of transformation on others, delaying the essential data transformation step. “The
upshot is – if you wanted to just move some data around, then ELT is your thing. But if you
want believable data, then you have to do ETL. So the choice is yours. Do you want data quickly
and easily, that may be essentially unreliable? Or do you want data that forms a firm foundation
for – AI, analytics, data mesh, ML, et al?”. While modern ETL practices have expedited data
processing, its procedural nature and lack of a systematic analysis of the data transformation
process also brought challenges like incomplete or irrelevant data, tight coupling between
services and analytics, and a surge in technical debt. There is the need of a solution involving
a return to upfront semantic data definition and design driving a declarative transformation
process.</p>
      <p>The Importance of Business Literacy. Bridging the gap between business and technical
perspectives emerges as a crucial necessity. Envisioning and developing semantic warehouses
that facilitates enriched communication between business leaders and data practitioners through
context-rich data representations serve as the cornerstone for unified and comprehensible
data landscapes. As Juan Sequeda says, data modelling without working with subject matter
experts/end users is simply a recipe for disaster.</p>
      <p>
        Unifying Ontologies for Data Integration. The integration of diverse metadata within
enterprises requires a systematic approach. Semantic modelling supports the harmonisation of
disparate data elements within an organisation. A singular, adaptable, and shareable ontology
serves as a linchpin, unifying complex data landscapes, fostering cohesive data management
strategies, enabling seamless data integration and enhancing interoperability [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>A Pragmatic Approach. Dave McComb calls for a pragmatic approach to semantic
enhancement, balancing the theoretical underpinnings of ontology development with practical
implementations tailored to industrial settings. He advocates for methodologies that facilitate
the creation of ontologies that are adaptable, shareable, and capable of evolving alongside
dynamic data landscapes.</p>
      <p>
        The Boring Data Stack. Joe Reis introduces the concept of the “Boring Data Stack", signalling
a shift in focus from managing underlying technologies to addressing critical yet often neglected
aspects like data governance and semantic modelling [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. He emphasises the importance of
these “boring" practices, attributing their significance to the era of AI and ML advancements.
He points out the need for organisations to address data quality issues, especially in the context
of AI and ML applications, where messy datasets pose significant challenges. Reis envisions
a future where data-centric approaches evolve into knowledge-centric ones, emphasising the
necessity of robust data governance, management, and semantic modelling practices to achieve
this transition efectively. He stresses the criticality of conceptual and logical data modelling in
aligning data with the realities of business operations, cautioning against the prevalent tendency
to solely focus on physical data models disconnected from business needs.
      </p>
      <p>Given this context, we believe a contribution to support proper semantic modelling within
a data preparation pipeline is badly needed. In this paper we introduce and explain the basic
formalisation of the Semantic SQL Transducer, a well-founded but practical tool providing the
materialised lossless and possibly conceptual view of an arbitrary relational data, contributing
to a knowledge-centric data stack. The Semantic SQL Transducer can be seen as a seamless
semantic wrapper around arbitrary relational data at any stage of the data stack, independently
on its architecture. The advantage of this technology is that it can be seen as a replacement
of the data it models, providing a restructuring of the data according to its restructured (and
possibly conceptual) model as a standard SQL database, which can be therefore queried,
updated, transformed. It can be used also to replace procedural data transformation tasks with
semantic-based declarative executable specification of the transformation task, guaranteeing
the losslessness of the transformation itself. By restructuring the relational data directly in its
conceptual model, the transducer provides a conceptual access to the data, providing to business
users and data analysts the right understanding of the available data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The role of a Semantic SQL Transducer</title>
      <p>Our contribution to a knowledge-centric data stack consists in a SQL-based tool (supported
by a design methodology) providing the materialised lossless conceptual view of an arbitrary
relational source data. Such materialised conceptual view can be queried and updated using
the knowledge vocabulary, with virtually no overhead with respect to the original source data.
Updates to the materialised conceptual view are replicated instantaneously to the source data
(e.g., to push a semantic-conscious data cleaning update to the source), and updates to the
source data are replicated instantaneously to the materialised conceptual view providing always
a fresh view of the source. The materialised view is the conceptual lossless mirror of the source
data, and it acts as a mediator by providing the conceptual API for a complete access and change
to the source data.</p>
      <p>Our Semantic SQL Transduceris based completely on standard SQL technology, it can be
deployed on any SQL platform, and it does not require any additional tool or code to work. The
transducer can exactly translate legacy SQL queries and updates over the source schema to
SQL queries and updates over the conceptual schema, and it can exactly translate analytical
SQL queries and updates over the conceptual schema to SQL queries and updates over the
source schema. By using semantic SQL transducers within a data prep process as semantic
wrappers around the data sources (e.g., see Figure 1), the “transform” part of the ETL process
can operate over semantically well defined entities and relationships. The transducer supports
transactions, to systematically guarantee semantic integrity and consistency of both the source
and its conceptual model.</p>
      <p>
        In order to represent the exact semantics of the source data, the Semantic SQL Transducer
supports several popular conceptual data models: ERD, ORM, UML Class Diagrams, Property
Graphs Schemas, Knowledge Graphs. It is based on several years of theoretical research by us
on the formalisation of the connection between conceptual data models and relational databases,
and on the formalisation of core SQL [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref8 ref9">8, 9, 10, 11, 12</xref>
        ]. A rigorous methodology to properly design
a semantic SQL transducer given the data sources has been studied and experimented [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ];
we are developing several tools to support it, which are not yet publicly available.
      </p>
      <p>We believe that our pragmatic approach to semantic enrichment provides a useful
knowledgebased core element which can be embedded within many diferent data architectures, improving
on the issues emphasised above in the direction of making the boring data stack more exciting:
auditing and data governance now are based on a clear semantic view of the data, supporting a
more transparent environment to exploit business literacy; during data preparation, semantic
integrity checks can be defined over the conceptual structures, and data cleaning can be enforced
at the level of the conceptual view; data integration and entity recognition are now
semanticbased, and the presence of a unified conceptual model exactly capturing the diverse data sources
supports the harmonisation of the disparate data elements within the enterprise; the gap between
data management and data analysis is reduced, being mediated by the transducers, reducing
therefore the risk of a technical debt; data scientists operate over high-quality data, contributing
to a more robust and eficient data warehouse.</p>
      <p>Note that there are many semantic-based approaches that introduce an intermediate layer
between the data layer and the business layer, which should be compared the proposal, such as
enterprise data fabrics, data meshes, data lakes, etc. The proposed Semantic SQL Transducer
is orthogonal to any of such architectural choices. The suggested semantic data stack in
Figure 1 above had the only purpose how the Semantic SQL Transducer could support semantic
enrichment in a modern data stack. The unicity of our proposal lies in the fact that it can
losslessly "present" the data in a restructured way, possibly according to its conceptual schema,
always using SQL as the foundational formalism.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Inside the Semantic SQL Transducer</title>
      <p>The abstract internal architecture of the Semantic SQL Transducer is shown in Figure 2. It
is a generic architecture, implementing the lossless bidirectional interoperation between two
databases, called S (source) and T (target). The SQL code guarantees that the two databases
are always automatically synchronised after any update (wrapped within a transaction) to
any of the two databases. The two databases maintain their original constraints and indexes,
maximising therefore the eficiency of querying. The updates to a database are recasted directly
as actual updates to the other database, maximising therefore the eficiency of updates. Some
attention has to be paid to avoid infinite looping of the triggers.</p>
      <p>
        The real complexity comes in defining both the lossless mappings from S to T and from T to S
(appearing in the SQL code as create table X as select ...) and the constraints of the
two databases S and T. Those mappings and constraints are provided by our theory of lossless
transformations [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref9">9, 14, 11, 13, 12</xref>
        ], based on the original works on information capacity [
        <xref ref-type="bibr" rid="ref15">15, 16,
17, 18, 19, 20</xref>
        ].
      </p>
      <p>When the transducer is used to provide the semantic enrichment as a conceptual view of a data
source within an ETL pipeline, the database S is indeed the source database, and the database T
is the dynamic restructuring of S according to its conceptual schema. The involved mappings
and constraints are a generalisation of classical mappings studied in the reverse engineering
literature [21, 22, 23, 24, 25, 26], started by the seminal papers by Hainaut [27, 28, 29].</p>
      <p>Source S:
create table S1 ...
...
create table Sm ...</p>
      <p>Target T:
create table T1 ...
...
create table Tn ...</p>
      <p>Insert S T:
create table S1_INSERT ...
...
create table Sm_INSERT ...</p>
      <p>(with source constraints)
(with source constraints)
(with target constraints)
(with target constraints)
(same schema asS1)
(same schema asSm)
create trigger S1_INSERT after insert on S1 ... (updates S1_INSERT)
...
create trigger Sm_INSERT after insert on Sm ... (updates Sm_INSERT)
create table T1_INSERT as select ... from S1_INSERT,...,Sm_INSERT ...
...
create table Tn_INSERT as select ... from S1_INSERT,...,Sm_INSERT ...
create trigger T1_INSERT after insert on T1_INSERT ... (updates T1)
...
create trigger Tn_INSERT after insert on Tn_INSERT ... (updates Tn)
Insert T S:
create table T1_INSERT ...
...
create table Tn_INSERT ...</p>
      <p>(same schema asT1)
(same schema asTn)
create trigger T1_INSERT after insert on T1 ... (updates T1_INSERT)
...
create trigger Tn_INSERT after insert on Tn ... (updates Tn_INSERT)
create table S1_INSERT as select ... from T1_INSERT,...,Tn_INSERT ...
...
create table Sm_INSERT as select ... from T1_INSERT,...,Tn_INSERT ...
create trigger S1_INSERT after insert on S1_INSERT ... (updates S1)
...
create trigger Sm_INSERT after insert on Sm_INSERT ... (updates Sm)
Similarly: Delete S</p>
      <p>T, Delete T</p>
      <p>S.</p>
      <p>(same schema as T1)
(same schema asTn)
(same schema asS1)
(same schema asSm)</p>
      <p>
        We can show that any source database has associated a unique canonical abstract relational
model [30], which is the lossless materialisation of the database in its conceptual schema in
6ℎ normal form. The canonical abstract relational model has a direct correspondence with
the most popular conceptual modelling languages such as ERD, ORM, UML Class Diagrams,
Property Graphs Schemas, RDF-based models. We provide a rigorous methodology to properly
design the conceptual schema in the form of a canonical abstract relational model, given the
data sources [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]; we are developing several tools to support it. In the next Section we will
define the notion of losslessness in database restructuring, and we will explain all the basic
transformation steps of the design methodology through a completely worked out example.
      </p>
      <p>In conclusion, the Semantic SQL Transducer provides a seamless access to databases through
their conceptual schemas, and it contributes to a knowledge-centric data stack adding a
declarative semantic layer to databases.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Designing the Semantic SQL Transducer</title>
      <p>In order to formalise the SQL transducer, we introduce first the notion of a first-order database
schema. A first-order database schema  is a pair ⟨A, C⟩ where A is a set of database
predicates with their attributes (1, 2, · · · ,  ) – for simplicity we do not consider here the
domains of the attributes – and C is a set of first order constraints (aka dependencies) over
the predicates. In order to capture exactly SQL, we restrict constraints to be in the
domainindependent fragment of first-order logic: all interesting kinds of constraints can be represented,
ranging from functional and multivalued dependencies (including keys), to inclusion
dependencies (including foreign keys), to constraints on domain values. We will use a standard notation
for classical database dependencies, most notably, 1, · · · ,   → 1, · · · ,   for functional
dependencies, 1, · · · ,   ↠  1, · · · ,   for multivalued dependencies, 1, · · · ,   ⊆  1, · · · ,  
for inclusion dependencies, and  ⊆ {“ 1”, · · · , “ ”} for domain constraints. We will write
general dependencies using relational algebra.</p>
      <sec id="sec-4-1">
        <title>Lossless transformations.</title>
        <p>Suppose that () and ( ) are the sets of legal database instances (or models) for schemas 
and  respectively: following [16, 17] a schema transformation from  to  is a total mapping
function → : () → ( ). In order to define a lossless transformation, we need to introduce
ifrst the notion of schema dominance [16, 17].</p>
        <p>Given two schemas  and  ,  dominates  if there is a total and injective mapping function
→ : () → ( ) which maps legal database instances in  to legal database instances in
 . Equivalently, we can say that  dominates  if there are two mapping functions → :
() → ( ) and  → : ( ) → () exist, such that their composition,  → ∘  → (the
result of applying  → after applying → ), is the identity on ().</p>
        <p>Two schemas  and  are equivalent, written  ≡  , if and only if  dominates  and
 dominates  . When two schemas  and  are equivalent, the mappings → and  →
are bijective, and we say that both schemas have the same information capacity and that the
transformation is lossless.</p>
        <p>In our setting, we consider mappings as first-order views establishing the relation between
two database schemas. More precisely, given two schemas  and  , a first-order mapping from
 to  , written → , is a set of first-order views  =  for each predicate  of arity 
in A , with  a relational algebra expression of arity  over the alphabet A.</p>
        <p>
          In the first-order setting, it can be proved that the definition of equivalence (or lossless
transformation) between  and  corresponds to the following condition over the schemas and
mappings: (C ∪ → ) ≡ (C  ∪  → ), which really means (⟨A ∪ A , C ∪ → ⟩) =
(⟨A ∪ A , C ∪  → ⟩); see [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Transformation patterns.</title>
        <p>
          Transformation patterns are crafted templates describing a specific structure of schema
transformation with the constraints necessary to ensure its losslessness [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The two basic lossless
transformation patterns are vertical decomposition and horizontal decomposition. We introduce
them via two basic examples.
        </p>
        <p>Given two schemas  and  as follows:
 = ⟨{(, , )}, (. → .)}⟩
 = ⟨{(, ), (, )}, (. → .), (. = .)}⟩
The schemas  and  have the same information capacity since there is a lossless transformation
through the following mappings – characterising the vertical decomposition in the classical
database literature:
→ = {( =   ), ( =   )}
 → = {( =  ◁▷ )}
The vertical decomposition transformation pattern maps a schema with a join dependency (i.e.,
a key dependency, or a functional dependency, or a multivalued dependency) to its vertical
decomposition, with all the appropriate dependencies in both schemas to guarantee losslessness.</p>
        <p>As an example of a lossless horizontal decomposition transformation, consider the schema  :
 = ⟨{(, ), 1(, ), 2(, )}, {(1. → 1.), (2. → 2.),
(1. = {“”}), (2. ̸⊆ {“”}),
(   =   1 ∪   2), (  1 ∩   2 = ∅)}⟩
The schemas  and  have the same information capacity since there is a lossless transformation
through the following mappings – characterising the horizontal decomposition via the condition
 =“”  :
 → = {(1 =  =“” ), (2 =  ̸=“” )}
→ = {( = 1 ∪ 2)}
The horizontal decomposition transformation pattern maps a schema to a horizontally
decomposed one via a selection condition, with all the appropriate dependencies in both schemas to
guarantee losslessness.</p>
        <p>We can also observe that also  and  have the same information capacity, since they are
related by a sequence of lossless transformations.</p>
        <p>A special case of horizontal decomposition is the lossless transformation leading to a SQL
NULL-free schema. According to the logic theory of SQL NULL values [31, 32], a schema has
the same information capacity as an horizontally decomposed one via a NULLABLE condition
over some attribute. Whenever there is a NULLABLE constraint over an attribute, a table can
be losslessy decomposed into two tables, one having all the attributes but not the NULLABLE
one, and the other having all the attributes but with a NOT NULL constraint replacing the
NULLABLE constraint.</p>
        <p>
          We have identified several lossless transformation patterns [
          <xref ref-type="bibr" rid="ref12 ref14">14, 12</xref>
          ], which can be used
to design a Semantic SQL Transducer allowing for arbitrary data restructuring processes,
whenever we want to guarantee that no information is lost during the restructuring process.
The transformation patterns identify the lossless mappings from S to T and from T to S and the
constraints of the two databases S and T, needed to design a correctly working Semantic SQL
Transducer, as described in Section 3.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Reverse Engineering as Semantic Layer</title>
        <p>A very special data restructuring process is the reverse engineering process, which looks for
the lossless transformation from a source database schema to the schema corresponding to its
conceptual schema – see [27] for a survey. This is the scenario we have presented in Section 2:
we want to expose the source data with a vocabulary that corresponds to its conceptual schema,
useful for the business perspective. If the transformation is lossless, we have the guarantee that
no information is lost, and that high-level users can query and update freely the transformed
database, in this case the database organised in a meaningful structure. More specifically, we
want to losslessly transform a source schema into a schema in 6ℎ normal form with explicit
Object Identifiers (OIDs) to identify “entities", namely instances of entity types. Object identifiers
can be implemented by surrogate keys, URIs, or UUIDs. This form is called an Abstract Relational
Model (ARM) [33]. Given an arbitrary database schema, there exists a unique Canonical Abstract
Relational Model (CARM) for that schema, based on the 5ℎ or 6ℎ normal forms, which plays the
role of the Core Conceptual Schema of the original database expressible in conceptual modelling
languages such as ORM, EER, UML class diagrams, or in RDF-based modelling languages. In
order to show losslessness, we consider the set of legal database instances of the CARM schema
by projecting away the OID attributes of entities; as a consequence, queries or updates over the
CARM schema can never explicitly ask or update OID values.</p>
        <p>In order to understand how to losslessly transform a database schema into an equivalent one
(the CARM) which includes OIDs, let’s consider the following basic example. Assume we have
a schema in 5ℎ normal form, so that the only constraints within a table are key constraints,
and the constraints across tables are foreign keys or inverse foreign keys, for example:
Employee(ssn,name), works-in(ssn,depname), Department(depname,address)
Employee.ssn → Employee.name
works-in.ssn → Employee.ssn
works-in.depname → Department.depname</p>
        <p>Department.depname → Department.address
A domain expert should recognise that Employee and Department are entity types, while
worksin is a relationship type. As a rule of thumb, we can recognise entity types since they should
be the target of at least a foreign key with a relationship type as source, while a relationship
type should have at least one attribute as the source of a foreign key with an entity type as
target. A new attribute with domain OID (disjoint from STRING and INTEGER) is added as
a surrogate key to each entity type table, and coherently a new OID attribute replaces the
attributes involved in a foreign key path from the entity type. The foreign key and inverse
foreign key constraints holding across tables are duplicated to hold between the added OID
attributes. Following our example, the lossless transformation of the above schema with added
OIDs is:</p>
        <p>ssn
name</p>
        <p>Employee
works-in</p>
        <p>Department
depname
address
The schema above is the logical representation corresponding to the Entity-Relationship Diagram
conceptual schema of Figure 3.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. A complete example</title>
      <p>Suppose we have a source database schema as described at the top of Figure 4. The schema is
composed by a single table Source and a set of constraints. As humans, we can tell that the
schema is about people, identified by their social security number, having one or more phone
numbers, and possibly working in some department, identified by its name, having an address,
which also uniquely identifies the department at that address.</p>
      <p>Clearly, a lot of the information we just described about this database is hidden in the schema,
Source
ssn STRING
phone STRING
name STRING
depname STRING
address STRING
poid
ssn</p>
      <p>Person-ssn</p>
      <p>OID</p>
      <p>STRING
poid
poid</p>
      <p>Person</p>
      <p>OID
Employee</p>
      <p>OID
poid
name
has-name</p>
      <p>OID</p>
      <p>STRING
poid
doid
works-in</p>
      <p>OID
OID</p>
      <p>has-phone
poid OID
phone STRING</p>
      <p>Department-name
doid OID
depname STRING</p>
      <p>Department
doid OID</p>
      <p>has-address
doid OID
address STRING
and we may wonder which could be its explicit conceptual schema to have a more direct
understanding of the data. By jumping a little ahead, let’s have a glance at the conceptual
schema, expressed in the ORM notation, in Figure 5. That schema describes precisely,
nonambiguously, and formally the data as we were saying above. But how can we get this conceptual
schema from the original source schema? First we notice that this conceptual schema has a
direct representation as a logical schema in the relational setting: this is described at the bottom
of Figure 4. The relational version of the conceptual schema denotes exactly the same legal
databases as the ORM schema of Figure 5, but now in a pure relational setting. Note that the
constraints of this schema are only key constraints and unary inclusion dependencies among
OID datatypes – this property holds for all core conceptual schemas derivable from arbitrary
source schemas.</p>
      <p>The core conceptual schema is obtained form the source database schema by the application
of a sequence of lossless transformation patterns, of the type we have briefly introduced in the
previous Section. Since any lossless transformation step is accompanied by the mappings in the
two directions, we get from the transformation process also the views from the source data to
the conceptual data and the views from the conceptual data to the source data. In our example,
these mappings are shown in Figure 6. We have devised a methodology driving the design of
the correct sequence of lossless transformation patterns leading to a core conceptual schema
from a source schema.</p>
      <p>So now we have all the ingredients (the two sets of constraints and the two mappings) to
ifnalise the implementation of a Semantic SQL Transducer for the source schema as explained
in Section 4. Many things can be done once we have the transducer in place. We can understand
what the source data is about. We can query the source database with SQL queries using only
the core conceptual schema table names, bridging therefore the gap between business models
and available data sources. Along these lines, business users can directly update the source
data using just the conceptual vocabulary. In the other direction, legacy queries over the source
database can be explained by looking at their expansion in terms of the conceptual schema. The
Semantic SQL Transducer provides a well-founded semantic layer for data catalogues. Source
data structured according its core conceptual schema makes data analysis much more efective,
since the correlations, the classifications, and the similarities of the data elements are much
more meaningful when done in business terms. Indeed, we observe that the conceptual schema
(Figure 5) is materialised in the relational database (lower part of Figure 4) using exactly the
vocabulary appearing in the ER diagram (Figure 5), and that both the source and the conceptual
databases (Figure 4) are connected via two sets of mappings (Figure 6), which implement the
lossless translations of the queries and of the updates at both ends.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>We have introduced in this paper a tool to losslessly restructure relational data, allowing
for seamless views of the data, which can be queried and updated at both ends maintaining
consistency and integrity. A special kind of transformation is when a database is restructured
according to its conceptual schema, providing therefore a materialised copy of the data, always
in sync with it, using the actual vocabulary understood by the business. The goal of this tools is
to support Knowledge-based Artificial Intelligence for Data Science, enabling advanced and
interpretable data analytics: by integrating semantic layers into data analysis pipelines, data
scientists can develop more robust models and interpret predictions more efectively. This
ensures data quality, accessibility, integrity, meaning, and interoperability.</p>
      <p>We have briefly experimented this framework over the data collected from construction
process monitoring within the Confucious project of the Free University of Bozen-Bolzano. We
are planning a more serious evaluation in the context of a data integration project in a large
enterprise.</p>
      <p>This paper is a revised and extended version of [34]. This long-standing work has been
realised through collaborations and discussions with Nicola Pedot, Nonyelum Ndefo, Francesco
Sportelli, Sergio Tessaris, Volha Kerhet, Nhung Ngo, Paolo Guagliardo, David Toman, Grant
Weddell, Alex Borgida, Terry Halpin, Jan Hidders, Sebastian Link.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[16] R. Hull, Relative information capacity of simple relational database schemata, SIAM</p>
      <p>Journal on Computing 15 (1986) 856–886.
[17] R. J. Miller, Y. E. Ioannidis, R. Ramakrishnan, The use of information capacity in schema
integration and translation, in: 19th International Conference on Very Large Data Bases
(VLDB-1993), 1993, pp. 120–133.
[18] X. Qian, Correct schema transformations, in: EDBT’96, 5th International Conference on</p>
      <p>Extending Database Technology, Springer, 1996, pp. 114–128.
[19] A. Poulovassilis, P. McBrien, A general formal framework for schema transformation,</p>
      <p>Data &amp; Knowledge Engineering 28 (1998) 47–71.
[20] P. McBrien, A. Poulovassilis, Data integration by bi-directional schema transformation
rules, in: Proceedings 19th International Conference on Data Engineering (ICDE-03), 2003,
pp. 227–238.
[21] N. A. Mian, S. A. Khan, N. A. Zafar, Database reverse engineering methods: What is
missing?, Res. J. Recent Sci. 2 (2013) 49–58.
[22] N. Lammari, I. Comyn-Wattiau, J. Akoka, Extracting generalization hierarchies from
relational databases: A reverse engineering approach, Data &amp; Knowledge Engineering 63
(2007) 568–589.
[23] I. Astrova, Reverse engineering of relational databases to ontologies, in: European</p>
      <p>Semantic Web Symposium, Springer, 2004, pp. 327–341.
[24] C. Soutou, Relational database reverse engineering: algorithms to extract cardinality
constraints, Data &amp; Knowledge Engineering 28 (1998) 161–207.
[25] M. Andersson, Extracting an entity relationship schema from a relational database through
reverse engineering, in: International Conference on Conceptual Modeling, Springer, 1994,
pp. 403–419.
[26] R. H. Chiang, T. M. Barron, V. C. Storey, Reverse engineering of relational databases:
Extraction of an EER model from a relational database, Data &amp; knowledge engineering 12
(1994) 107–142.
[27] J.-L. Hainaut, Introduction to database reverse engineering, LIBD Lecture Notes (2002).
[28] J.-L. Hainaut, M. Chandelon, C. Tonneau, M. Joris, Contribution to a theory of database
reverse engineering, in: [1993] Proceedings Working Conference on Reverse Engineering,
IEEE, 1993, pp. 161–170.
[29] J.-L. Hainaut, C. Tonneau, M. Joris, M. Chandelon, Schema Transformation Techniques for</p>
      <p>Database Reverse Engineering, Springer Verlag, 1993, pp. 353–372.
[30] W. Ma, C. M. Keet, W. Oldford, D. Toman, G. E. Weddell, The utility of the abstract
relational model and attribute paths in SQL, in: Knowledge Engineering and Knowledge
Management - 21st International Conference, EKAW, Springer, 2018, pp. 195–211.
[31] E. Franconi, S. Tessaris, Relational algebra and calculus with SQL null values, CoRR
abs/2202.10898 (2022). arXiv:2202.10898.
[32] E. Franconi, S. Tessaris, On the logic of SQL nulls, in: 6th Alberto Mendelzon International
Workshop on Foundations of Data Management, CEUR Workshop Proceedings, 2012, pp.
114–128.
[33] A. Borgida, D. Toman, G. E. Weddell, On referring expressions in information systems
derived from conceptual modelling, in: Conceptual Modeling - 35th International Conference,
ER 2016, 2016, pp. 183–197.
[34] T. Abgrall, E. Franconi, Understanding the semantic sql transducer, in: J. P. A. Almeida,
C. Di Ciccio, C. Kalloniatis (Eds.), Advanced Information Systems Engineering Workshops,
Springer Nature Switzerland, Cham, 2024, pp. 135–146.
[35] J. Albert, Y. E. Ioannidis, R. Ramakrishnan, Equivalence of keyed relational schemas by
conjunctive queries, J. Comput. Syst. Sci. 58 (1999) 512–534.
[36] A. Khatiwada, R. Shraga, W. Gatterbauer, R. J. Miller, Integrating data lake tables, Proc.</p>
      <p>VLDB Endow. 16 (2022) 932–945.
[37] P. Papotti, R. Torlone, Schema exchange: Generic mappings for transforming data and
metadata, Data &amp; Knowledge Engineering 68 (2009) 665–682.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciaccia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinenghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Conceptual constraints for data quality in data lakes</article-title>
          ,
          <source>in: Proceedings of the 1st Italian Conference on Big Data and Data Science</source>
          (itaDATA
          <year>2022</year>
          ), Milan, Italy,
          <source>September 20-21</source>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data preparation: A survey of commercial tools</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>49</volume>
          (
          <year>2020</year>
          )
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Console</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , Editorial:
          <article-title>Special issue on quality aspects of data preparation</article-title>
          ,
          <source>ACM J. Data Inf. Qual</source>
          .
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <volume>40</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          :
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudré-Mauroux</surname>
          </string-name>
          ,
          <article-title>Leveraging knowledge graphs for big data integration: the XI pipeline</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>13</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dibowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Svetashova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Henson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Using semantic technologies to manage a data lake: Data catalog, provenance and access control</article-title>
          ,
          <source>in: 12th International Workshop on Scalable Semantic Web Knowledge Base Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <article-title>Ontology-based data access and integration</article-title>
          , in: L. Liu, M. T. Özsu (Eds.),
          <source>Encyclopedia of Database Systems, Second Edition</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Housley</surname>
          </string-name>
          , Fundamentals of Data Engineering,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https: //books.google.it/books?id=3qd2EAAAQBAJ.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Franconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          ,
          <article-title>A data warehouse conceptual data model for multidimensional aggregation</article-title>
          ,
          <source>in: Intl. Workshop on Design and Management of Data Warehouses, DMDW'99</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lubyte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tessaris</surname>
          </string-name>
          ,
          <article-title>Automatic extraction of ontologies wrapping relational data sources</article-title>
          ,
          <source>in: International Conference on Database and Expert Systems Applications</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , E. Franconi,
          <article-title>First-order ontology mediated database querying via query reformulation</article-title>
          ,
          <source>in: A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years</source>
          , volume
          <volume>31</volume>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Abgrall</surname>
          </string-name>
          ,
          <article-title>Formalization of data integration transformations</article-title>
          ,
          <source>in: New Trends in Database and Information Systems - ADBIS</source>
          <year>2022</year>
          , volume
          <volume>1652</volume>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>615</fpage>
          -
          <lpage>622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Abgrall</surname>
          </string-name>
          ,
          <article-title>Schema decomposition via transformation patterns</article-title>
          ,
          <source>in: 32nd Symposium of Advanced Database Systems (SEBD-24)</source>
          , volume
          <volume>3741</volume>
          , CEUR-WS.org,
          <year>2024</year>
          , pp.
          <fpage>551</fpage>
          -
          <lpage>564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ndefo</surname>
          </string-name>
          , E. Franconi,
          <article-title>On preserving information in schema transformations: A constructive perspective</article-title>
          ,
          <source>in: 2nd IEEE International Conference on Artificial Intelligence and Knowledge Engineering</source>
          ,
          <string-name>
            <surname>AIKE</surname>
          </string-name>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ndefo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Franconi</surname>
          </string-name>
          ,
          <article-title>A study on information-preserving schema transformations</article-title>
          ,
          <source>International Journal of Semantic Computing</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>27</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>I. Kobayashi</surname>
          </string-name>
          ,
          <article-title>Losslessness and semantic correctness of database schema transformation: another look of schema equivalence</article-title>
          ,
          <source>Information Systems</source>
          <volume>11</volume>
          (
          <year>1986</year>
          )
          <fpage>41</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>