=Paper= {{Paper |id=Vol-2873/paper6 |storemode=property |title=Stratified Data Integration |pdfUrl=https://ceur-ws.org/Vol-2873/paper6.pdf |volume=Vol-2873 |authors=Fausto Giunchiglia,Alessio Zamboni,Mayukh Bagchi,Simone Bocca |dblpUrl=https://dblp.org/rec/conf/esws/GiunchigliaZBB21 }} ==Stratified Data Integration== https://ceur-ws.org/Vol-2873/paper6.pdf
                    Stratified Data Integration

 Fausto Giunchiglia[0000−0002−5903−6150] , Alessio Zamboni[0000−0002−4435−1748] ,
  Mayukh Bagchi[0000−0002−2946−5018] , and Simone Bocca[0000−0002−5951−4589]

      Department of Information Engineering and Computer Science (DISI),
                          University of Trento, Italy
{fausto.giunchiglia,alessio.zamboni,mayukh.bagchi,simone.bocca}@unitn.it



        Abstract. We propose a novel approach to the problem of semantic
        heterogeneity where data are organized into a set of stratified and inde-
        pendent representation layers, namely: conceptual (where a set of unique
        alinguistic identifiers are connected inside a graph codifying their mean-
        ing), language (where sets of synonyms, possibly from multiple languages,
        annotate concepts), knowledge (in the form of a graph where nodes are
        entity types and links are properties), and data (in the form of a graph
        of entities populating the previous knowledge graph). This allows us to
        state the problem of semantic heterogeneity as a problem of Representa-
        tion Diversity where the different types of heterogeneity, viz. Conceptual,
        Language, Knowledge, and Data, are uniformly dealt within each single
        layer, independently from the others. In this paper we describe the pro-
        posed stratified representation of data and the process by which data are
        first transformed into the target representation, then suitably integrated
        and then, finally, presented to the user in her preferred format. The pro-
        posed framework has been evaluated in various pilot case studies and in
        a number of industrial data integration problems.

        Keywords: Semantic Heterogeneity · Knowledge Graph Construction ·
        Stratified Data Integration


1     Introduction

Semantic Heterogeneity [1,2], namely the existence of variance in the represen-
tation of the same real-world phenomenon, has long been a major impediment
for effective large scale data integration implementations. It is inescapable in
nature and is rooted in the diversity which is inherent in different means of rep-
resentation (which itself is rooted in world diversity) [3,4]. The pervasiveness of
semantic heterogeneity in its various forms, has long been an overarching con-
cern in the data management research landscape [1,2]. In particular, its obtrusive
ramifications with reference to data integration scenarios have been widely ac-
knowledged, and partial approaches towards their resolution at the schema and
data level have been proposed, see for instance [1,5,6,7].
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
   In this paper we propose a novel approach which is based on the key as-
sumption that data should be represented as a set of stratified and independent
representation layers, namely:

 1. conceptual, where concepts, codified as a set of unique alinguistic identifiers,
    are connected inside a graph codifying their meaning;
 2. language, where sets of synonyms, i.e., synsets [8], possibly from multiple
    languages, annotate concepts;
 3. knowledge, in the form of a graph where nodes are entity types and links are
    properties; and
 4. data, in the form of a graph of entities populating the previous knowledge
    graph.

    The representation of data according to these four layers allows us to state
the problem of semantic heterogeneity as a problem of Representation Diversity,
where heterogeneity distributes itself over these layers thus being stratified into
four different types of diversity, viz. Conceptual, Language, Knowledge, and Data,
which can then be are uniformly dealt within each a single layer, independently
from the others. The proposed approach has three major advantages. The first
is that the combinatorial explosion deriving from the interaction of the four
different types of diversity is avoided and the complexity of the data integration
problem reduces to the sum of the complexity of each layer. Each layer can be
dealt with as if all the other layers presented no heterogeneity at all. The second
is that the techniques developed for each layer can be composed with the ones
developed in the other layers irrespectively of how heterogeneity appears in the
current problem. The third, which is a direct consequence of the second, is that,
within each layer, it is possible to exploit the large body of work which can be
found in the literature (see the related work section for a detailed discussion on
this point).
    In this paper, after restating the semantic heterogeneity problem as a repre-
sentation diversity problem, we describe the proposed stratified representation
of data and the process by which data are first transformed into the target rep-
resentation, then suitably integrated and then, finally, presented to the user in
her preferred format. The main contributions of this paper can be articulated as
follows:-

 1. a novel articulation of the problem of semantic heterogeneity as stratified
    into the above four layers of representation diversity (Section 2);
 2. a viable solution to the issues above in the form of an end-to-end logical data
    management architecture grounded in our four layered stratification of rep-
    resentation diversity, wherein we focus on accommodating the heterogeneity
    resident in each layer, independently from all the other layers (Section 3);
 3. an illustration of an implemented semi-automated data integration pipeline
    which exploits the representation of diversity presented in Section 3 (Section
    4).
The pipeline presented in Section 4 has been extensively evaluated within many
data integration pilot studies, which have spanned three years and has then ap-
plied in various real world problems. Section 5 presents a short highlight of these
activities. Towards the end of the paper, Section 6 contextualizes the contribu-
tion of our work by surveying the state of the art, while Section 7 outlines the
future research ventures.



Table 1: Three semantically heterogeneous schemas containing a heterogeneous
description of the same entity.

                                       Car
                     schema: schema:     schema:  schema:
           Nameplate speed fuelCapacity fuelType modelDate
            FP372MK       150          62           Petrol    2020-11-25

                                     Vettura
                        Targa     Velocità Tipo di corpo
                       FP372MK       158          Coupé

                                     Vehicle
               vso:VIN vso:feature vso:modelDate vso:speed
              FP372MK      Armrest          2020-11-25       155.0




2     The Stratification of Diversity

We show how semantic heterogeneity can be stratified in four independent di-
versity problems via the following motivating example ( see also Table 1).
There are three datasets {Car, Vettura, Vehicle} which refer to the same real
world entity, namely a car, which we assume has plate ‘FP372MK’. The first
dataset describes FP372MK as a car, having five attributes, four of which are
expressed using the automotive extension of schema.org.1 The second dataset also
considers the entity as a car but its description is provided in Italian. The third
dataset encodes the entity as a vehicle having four attributes expressed employing
the Vehicle Sales Ontology2 namespace ‘vso:’.
By a close look at the above example, it is easy to notice four different types of
diversity which we can briefly describe as follows:
1
    https://schema.org/docs/automotive.html
2
    http://www.heppnetz.de/ontologies/vso/ns
 – Conceptual diversity (L1):3 the same real world object is mentioned in two
   datasets using the concept denoted by the word car while in the third data
   set is called vehicle, namely using a more general term (because, for instance,
   in this latter case there is no need to distinguish among the various types of
   vehicles as the issue is that of counting the number of free parking lots).
 – Language diversity (L2): the same real object is described using three dif-
   ferent lexicons, namely, that of a natural language, i.e., Italian, and two
   namespaces, i.e., from the automobile extension of schema.org and from the
   Vehicle Sales Ontology, both of which use a different natural language, i.e.,
   English, as base language. Notice that these are three different lexicons, in-
   dependently developed where, therefore, the meaning of the terms used are
   intuitively similar but formally unrelated.
 – Knowledge diversity (L4): the same real world object is described using dif-
   ferent properties, the motivation being most likely in the different focus of
   the three databases. Thus, for instance, the first could be the description
   used in an online car rental which codifies its data using schema.org, the
   second could be the description used by the Italian Automobile Club, while
   the third could be the description used in an online sales portal.
 – Data diversity (L5): the same real world object is described in a way that,
   even when associated with the same properties, the corresponding values are
   different. There can be many reasons for this last source of heterogeneity,
   for instance, different approximations, different formats, different units of
   measure, different reference standards (e.g., date standards for dates) and
   so on.

Let us consider these four layers in detail.
Conceptual Diversity. The notion of concept is well known in the Philosophy
of Language literature, see, e.g. [10], and in Computational Linguistics, see, e.g.
[8].. Here we follow our own work, as described in [11,12], and take concepts to be
unique alinguistic identifiers. Concepts are organized in multiple hierarchies, one
per syntactic type (i.e., noun, verb, adjective) wherein a child (father) concept
is taken to be more specialized (more general) than the father (child) [8,12]. For
instance noun hierarchies are organized in terms of ‘hypernym-hyponym’ links
where, in the example in Table 1, the term car is a direct hyponym of the term
vehicle.
Language Diversity. Languages, taken here in a very broad sense to include,
e.g., natural languages, namespaces and formal languages. Language diversity
occurs both across and within languages. Thus, there are multiple languages
available for the purpose of representing the same concept, but also, even within
the same language, linguistic phenomena like polysemy and synonymy allow for
multiple diverse representations of entities. As a result there is a many-to-many
3
    L1, L2, L3, L4, L5 are the five layers into which we organize the representation of
    data. L3, the layer used to represent causality [9], is not discussed here because it is
    irrelevant to the goals of the paper.
mapping between words and concepts, both within the same language and across
languages [11,12].

Knowledge Diversity. We model knowledge as a set of entity types, also called
etypes, meaning by this, classes of entities with associated properties. Knowledge
diversity arises from the many-to-many mapping between etypes and the proper-
ties employed to describe them [13], and can appear in one of two different forms.
The first appears when there are ‘n’ representations of different etypes described
in terms of the same set of properties. The second manifests itself when there
are ‘n’ representations of the same etype with different sets of properties. As an
example of the latter situation, in Table 1 two datasets describe the same etype
car, but the two etypes are associated to three different groups of properties.

Data Diversity. We model data, meaning by this the concrete, ground knowledge,
that we have about objects in the world, as entities each associated with property
values, where properties are inherited from the etype of the entity. Data diversity
[1] exists because of the fact that the mapping between entities and the property
values used to describe them is many-to-many. Data diversity appears as well in
two different forms, wherein the same real world entity, associated with the same
properties, is described using different data values, while dually, two different
real world entities, still associated with the same properties, can be described
using the same data values. As an example of the latter situation, there can be
two identical cars which are both described by a set of attributes which do not
contain their plate or any other identifying attribute. The example in Table 1
provides an example of the former situation. Here, the three datasets refer to
the same entity, a car with plate ‘FP372MK’, which shares a common attribute
which is the car speed, but this property has three different values.
    Notice how the stratification of diversity presented above has the following
crucial characteristics. The first is that each layer models a different type of
phenomenon and the corresponding type of diversity. The second is that how di-
versity appears in one layer is completely independent of how it appears in any
other layer. The third is that L1, the conceptual layer, provides the grounding
for unifying the various types of diversity as they appear in the other layers. In
fact, in all respects, L1 is a logical theory, which can be codified in a logical lan-
guage, e.g., Description Logics, where the semantics of all terms (the alinguistic
identifiers) is univocally defined by the links of the hierarchy. The fourth and
last observation is that the diversity mappings which appear in L2, L4 and L5
are all many-to-many and this generates the type of combinatorial complexity
which makes it so difficult to handle the problem of semantic heterogeneity. As
already hinted to in the introduction, the stratification of semantic heterogeneity
provides a major advantage in that it allows to structure it in four independent
and much smaller problems, where each problem can be treated uniformly by
developing methods and techniques which are specialized just for that layer.
This last observation is the basis for the work presented in the remaining of this
paper.
               Fig. 1: The Representation Diversity Architecture


3   Representing Diversity

Fig. 1 depicts the proposed data management architecture instantiated (partially,
for lack of space) to the example in Table 1. Here the arrows represent the
functional dependencies which must be enforced among the different layers and,
therefore, implicitly define the order of execution which must be followed during
a data integration task, starting from the user input and concluding with the
fully integrated data. Fig.3 in Section 4 will later depict the process, tools and
algorithms which exploit the different representation layers in Fig.1 towards
producing, in the end, the target data integration.
    The language representation layer (L2) appears first and last in the architec-
ture in Fig.1. L2 enforces the input and the output dependence of the represen-
tation of data on the user language. In fact, language is the key enabler of the
bidirectional interaction between users and the platform. In the first phase, the
L2 input language is translated into the system internal L1 conceptual language
and the input language is only resumed during the last step, when the results
of the data integration steps are presented back to the user. To this extent, no-
tice how, in Fig.1 the language used in the LEG (L1), ETG (L4) and EG (L5)
are just ids, while the conversion table of the first phase in the repository is
the mapping from internal and external language(s). In this process, L2 is key
in keeping completely distinct the multilingual user-defined data representation
and the alinguistic system-level data representation. It is also important to no-
tice how the proposed data management architecture is natively multilingual
as a result of the L1 alinguistic concepts being the convergence of semantically
equivalent words in different languages. A very important case which can be
dealt by this architecture is the heterogeneity of namespaces, as also reflected in
the running example. Any number of namespaces and natural languages can in
fact be seamlessly integrated following the same uniform process.
    The management of conceptual diversity (L1), which functionally comes next
in sequence, involves the organization of the L1 alinguistic concepts, as identi-
fied in the first step, into a Language Entity Graph (LEG) which codifies the
semantic relations across concepts (and, therefore, among, the corresponding L2
input words). In order to achieve this goal we exploit, as a-priori knowledge, a
multilingual lexico-semantic resource, called Universal Knowledge Core (UKC)
[11,12] which represents words, synonyms, hyponyms and hypernyms quite sim-
ilarly to the Princeton Wordnet [8], still with important differences [11]. The net
result of this phase is an LEG with the following properties:
 – the concepts identified during the first phase are all and only the nodes in
   this graph;
 – these nodes are annotated with the input L2 terms, across languages;
 – these nodes are organized into a hierarchy which preserves the ordering,
   across the links of the UKC (in the case of nouns, the synonym/ hyponym/
   hypernym relations).




                     Fig. 2: An example Entity Graph (EG)


    The third representation layer (L4), dedicated to the management of knowl-
edge diversity, involves the construction of a (alinguistic) Entity Type Graph
(ETG) encoded using only concepts occurring in the LEG constructed during
the previous two phases. In this phase, the first step is to distinguish concepts
into etypes and properties (both object properties and datatype properties) while
the second step is to organize them into a subsumption hierarchy. The key ob-
servation here, which constitutes a major departure from the previous work is
that etype subsumption, as encoded in the ETG, is enforced to be coherent with
the concept hierarchy encoded in the LEG. Thus for instance, as from the above
example, the object with plate FP372MK, being a car, can be encoded to be
an entity of etype vehicle, but not of, e.g., etype organism. This fact, which
is natively enforced by the lexico-semantic hierarchy of the UKC for what con-
cerns natural languages (in the above example, Italian), is extended to cover also
the terms belonging to namespaces (in the above example that of schema.org
and that of the Vehicle Sales Ontology). This alignment of meanings across lan-
guages and namespaces, which absorbs a major source of heterogeneity present
in the (Semantic) Web is a natural consequence of the language and conceptual
alignment performed during the first two phases.
    In the fourth representation layer (L5), we tackle the heterogeneity of data
values by employing an Entity Graph (EG), namely, a data-level knowledge graph
populating the ETG with the entities extracted from the input datasets. Fig.2
reports the EG resulting from the first four phases. As you can notice this graph is
constituted of a backbone of L1 alinguistic ids, each annotated with the input L2
terms where, for each L2 term, the system remembers the dataset it comes from.
This information is crucial in case of iterated (multi-phased) data integration,
as it is usually the case, as the system needs to remember which new terms and
values substitute which old terms and values. This mechanism is implemented via
a provenance mechanism, not represented in Fig.1, which applies to all the input
dataset elements, both at the schema and at the data level. A last observation is
that in Fig.1 the unique id identifying the car with plate FP72MK is #589625,
a new identifier which never appeared before. In fact, any time a new entity is
identified, it is associated a unique id which is managed internally by the system
and that the user can see and also query, but not modify.




                 Fig. 3: The Representational Diversity Pipeline



4   Processing Diversity

The pipeline in Fig.3 describes the process used to manage and integrate the
diversity as it appears in the four layers described in the previous section. This
partially automated pipeline is highly flexible, independent of the input data
and domain, and largely customizable. As detailed in the next section, it has
been applied to the modeling of events, facilities, personal data, medical data,
university data [15] and many other domains. Its intended target users are Data
Scientists with no programming knowledge but with an understanding of the
domain and data integration problem to be solved. All tasks are managed via
a flexible user-friendly user interface and the process is assumed to start with
the input datasets being in a repository from where they can be uploaded. The
only required programming effort, which we assume should not be performed by
the data scientist, is that needed to extract the input datasets from, e.g., legacy
systems or open data repositories, and to pre-process them.
    The pipeline is composed of four main phases, as depicted in Fig.3, where
the first phase produces the integrated L1 and L2 representation of the LEG,
the second phase produces the ETG and the third phase produces the EG. The
fourth phase, the phase of EG Presentation, depicted in Fig.3 for completeness,
is representative of an external client application exploiting the LEG, ETG and
EG produced by the pipeline. During this phase, not further discussed, the data
scientist usually selects the language in which she wants the EG to be presented
to her, e.g., using the words of the input datasets or any other language supported
by the UKC.4
LEG Construction: This first activity takes in input all data to be integrated (in
the example above, this consists of all the three tables represented in Table 1),
and it extracts each word and multi-word occurring in the input tables (in the
case of namespaces, the namespace itself and the word are treated as a single
concept). The output of this phase is the L1 and L2 representation of the input
data organized in a LEG. During this phase the prior knowledge codified in the
UKC is heavily exploited (see the previous section for details). This activity is
performed with the help of the Word Sense Disambiguation (WSD) component
SCROLL, a multilingual NLP library and pipeline which is specialised to handle
the Language of Data, as defined in [14], namely the type of NLP sentences that
are usually found in data. At the moment SCROLL supports seven languages (in-
cluding various European languages but also Mongolian and Chinese) but, as
we have found out, because of the similarity of many languages, of the relatively
simple structure of the language of data, and of the fact that the system pro-
cessing is in control of the Data Scientist which validates each step, SCROLL can
also be useful in various other similar languages (where similar here means not
diverse, with language diversity being defined as in [12]). The main features of
SCROLL which make it quite suitable for multilingual data integration are:

 – it has been developed to be highly modular and with a clear split between
   the modules which are language independent from those which are language
   dependent;
 – in SCROLL, the tasks that are language dependent and must therefore be
   implemented for each new language (e.g., word segmentation in English and
   Italian is very different from word segmentation in Chinese) are executed
4
    In the current state of implementation, this phase can only perform a word by word
    translation without being able to reconstruct the overall meaning of a sequence of
    words.
   as soon as possible. The net advantage is that the more semantics depen-
   dent tasks, e.g., Entity Recognition (ER) and Word Sense Disambiguation
   (WSD) work on a conceptual representation of the data and are therefore
   implemented once for all;
 – SCROLL’s NLP pipeline is highly optimized and fully integrated with the
   UKC, mainly with the goal of implementing a highly optimized and highly
   effective language agnostic WSD task which is also domain aware;
 – it is often the case SCROLL encounters new words which are not in the UKC
   which, in turn may or may not contain the corresponding concept id. These
   situations are dealt with by suitably enriching the UKC according to a ded-
   icated mechanism.

The output of this phase is a spreadsheet, including the structured definitions
of new concepts and their relations, which, suitably interpreted on the basis of
the UKC hierarchy, codifies the LEG.
ETG Construction: This activity takes in input the schemas of the input datasets,
where all the words are now annotated with the LEG concepts and it constructs
the ETG which integrates them. This phase is performed via a Knowledge Edi-
tor, similar in spirit to Protègè,5 but highly integrated in the pipeline in Fig.3,
which is used interactively by the data scientist. Two are the main operations
which are performed during this phase with the help of the knowledge editor:

 – perform schema matching. This activity is performed manually but it ben-
   efits from the suggestions provided by a multilingual schema matcher [17]
   (where, however, an effective way to integrate this matcher in the knowledge
   editor is yet to be found);
 – build the resulting integrated ETG and possibly align it with a reference
   ontology. The goal of this step is to produce a clean and highly reusable
   ETG. This step at the moment is performed manually, but the plan is to in-
   tegrate this functionality inside the Knowledge Editor, exploiting the results
   described in [13].

The output of this phase is an OWL file codifying the ETG where all the terms
which are used are annotated with the LEG’s concept ids.
EG Construction: This activity takes the ETG and the input datasets, anno-
tated with the LEG concepts, and constructs a mapping between the data values
within each dataset and the ETG built during the previous step. The EG Con-
struction is an iterative activity which considers one dataset at the time. The
mapping operations are implemented through the usage of a specific tool, called
KarmaLinker, which consists of the integration of the Karma data integration
tool [5,6], which does not do any NLP, with SCROLL. Within each activity it-
eration, KarmaLinker maps the data values in the input dataset to the etypes
and properties of the ETG. Some of the most important and non trivial oper-
ations performed here are that of recognizing the entities which are implicitly
5
    https://protege.stanford.edu/
mentioned in the input datasets (entity detection), of recognizing their entity
types (etype recognition) and, finally, of recognizing whether there are multiple
occurrences of the same real world entities, possibly described using different
properties and property values (entity mapping, see the example of data het-
erogeneity presented in Section 2). In the example of Fig.1, all the three input
datasets are recognized to describe the same car which, as from Fig.2, is then
assigned the unique id #589625. The final output is an EG encoding the informa-
tion in the initial datasets, at all the four different levels of diversity, stored using
a language agnostic JSON-LD 6 format. See Fig.2 for a partial representation of
the EG constructed from the datasets in Table.1.


5     Evaluation

The representation architecture and pipeline described above, have been exper-
imented and evaluated during the past three years (2018, 2019, 2020), the last
year still being ongoing, as part of the Knowledge and Data Integration (KDI)
class, a six credit course of the Master Degree in Computer Science of the Uni-
versity of Trento.7 Table 2 reports the information regarding the population in-
volved (excluding PhD students which are not counted) and number of projects.
During this class students, 2-5 people per group, must generate an EG using the
pipeline above starting from a high level problem specification. Part of the task
is also to identify the most suitable datasets and pre-process them. Datasets are
usually found in open data repositories but some of them are also scraped from
the Web. The overall project has an elapsed time of 14 weeks during which stu-
dents have to work intensely, even not full time. We estimate the overall effort
that each group puts into building an EG in around 4-8 man-months, depending
on the case. At the end of the course, after the final exam, students are asked to
evaluate the methodology they have used (as partially described in this paper).
This evaluation involves various aspects including application scenarios, datasets
used, ETG and EG generation, language management and LEG generation, and
evaluation of the overall pipeline. As of to day we have piloted 24 projects and
75 evaluations.
A first question in the evaluation is about L2 and the management of language
diversity, with a specific emphasis on the use of the UKC. The percentage of
students which believe that the explicit management of language diversity, as a
dedicated independent phase, is worthwhile was 79.4% in 2018 and 80% in 2019.
A second question is about L4 and the management of knowledge diversity. More
specifically, here the purpose of the question is to understand if students find
it helpful to define the etypes of the input data, and if the pipeline properly
supports this task. The 69% of the students in 2018, and 95% in 2019, provided
6
    https://json-ld.org/
7
    See https://unitn-kdi-2020.github.io/unitn-kdi-2020/ for more details. This site con-
    tains the material used during the 2020 edition of the course and it consists of theo-
    retical and practical lectures, as well as demos of the tools to be used, some of which
    have been mentioned above.
        Table 2: Evaluation’s subjects in KDI class - 2018, 2019, 2020.
                                     2018    2019   2020    Tot
                 # students            29     20      26     75
                 # project teams       14      4      6      24
                 % Male               69%    75%     95%     59
                 % Female             31%    25%     5%      16



a positive answer. Moreover the 72.4% (2018), and 60% (2019), stated that
grounding the types with the reference ontology, usually an upper ontology,
facilitates the construction of the ETG and in particular the positioning of the
entity types. A last set of questions are asked about the overall usability of the
methodology, where each question aims at the evaluation of a specific usability
aspect. The answers are reported in Fig.4 for both 2018 and 2019, over a 1 to 7
scale (1 maximally positive, 7 maximally negative). Observing the figures below
we can notice an overall positive trend. Furthermore we can see that, with respect
to 2018, in 2019 students found the methodology easier to use and also easier
to learn. Moreover, we noticed that the level of efficiency in the accomplishment
of the data integration objectives increased in 2019. This is part of an overall
positive trend that, we believe, will also be confirmed in 2020 and that, we
believe, relates to the continuous adjustments to the methodology and to the
tools that we perform every year, also based on the feedback collected during
the KDI course.




            Fig. 4: Usability of the methodology: (a) 2018, (b) 2019



6   Related Work
As from the introduction, our approach to representing and managing seman-
tic heterogeneity as the stratification of four independent problems, was never
proposed before. However, the very same fact that we are stratifying semantic
heterogeneity into the four problems of conceptual diversity, language diversity,
knowledge diversity and data diversity, one at the time, allows us to refer and
exploit the huge amount of work which has been independently done in these
areas. The following of this section concentrates on this work, across the four
layers, including also earlier work from the authors.
    The notion of language diversity, as described here, was first introduced in
[11,12] which also provide a detailed description of the UKC. However, this
work builds upon decades of the work in the development of multilingual lexical
resources, see, e.g. [8,18]. The main innovation with respect with this earlier work
is that LEGs, and the UKC in particular, have a separated and independent
conceptual layer while, in the previous lexico-semantic resources, L1 and L2 are
collapsed. The stratification between L1 and L2 on one side and L4 on the other
side, and the exploitation of lexical resources in order to do schema integration is
a direct application of the ideas and technology developed in the field of ontology
matching, see, e.g., [19,20]. The work in [17], used during the ETG construction,
builds upon this previous work by proposing NuSM - a multilingual ontology
matching framework which heavily exploits the UKC.
    Our proposal of using knowledge graphs is very much in sync with the huge
amount of work now being developed in this area [21,22]. Differently from the
previous work, we uniquely stratify Knowledge Graphs into four layers, how-
ever, see [23,24,25] for an approach which is quite similar in spirit to ours, in
particular in the distinction between L4 and L5. Furthermore, the validity of
an ontology guided, knowledge graph backed approach towards data integration
and presentation has been favourably discussed in [26,27].
    In the context of semantic data integration [28], the survey in [29] noted the
prevailing difficulties as non-standardized identity management, multilinguality
management, data and schema heterogeneity, namely all issues which our work
addresses. The work in [7] also mentions architectural, structural, syntactic and
semantic heterogeneity in data integration frameworks, all issues that our pro-
posed approach tackles. The work in [30] and [31] combined, highlights diverse
parametric aspects of six major openly available knowledge graphs, with [30] call-
ing for newer approaches in knowledge modeling and new forms of knowledge
graphs. More specifically, Wikidata [32] emerges as a feature-rich, cross-domain
openly available knowledge graph [30]. Still, due to its very nature, with respect
to the work proposed here, the work on Wikidata lacks an adaptive schema cus-
tomizable to different data integration scenarios, and an overall explicit, strati-
fied data management architecture.


7   Conclusion

In this paper we have presented an innovative organization of data management
stratified across four layers of heterogeneity - namely concept, language, knowl-
edge and data. This has allowed the re-interpretation of semantic heterogeneity
as a problem of representation diversity and the proposal of a stratified logical
architecture which deals with this problem. The future work will consist of a gen-
eralization of the pipeline presented in this paper into a full-fledged knowledge
graph based methodology for data integration.


Acknowledgements
The research conducted by Fausto Giunchiglia, Mayukh Bagchi and Simone
Bocca has received funding from the “DELPhi - DiscovEring Life Patterns”
project funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale
(PRIN) 2017 – DD n. 1062 del 31.05.2019. The research conducted by Alessio
Zamboni was supported by the InteropEHRate project, co-funded by the Euro-
pean Union (EU) Horizon 2020 programme under grant number 826106.


References
 1. Halevy, A.: Why your data won’t mix: Semantic Heterogeneity. Queue 3(8), 50–58
    (2005)
 2. Hull, R.: Managing semantic heterogeneity in databases: a theoretical perspective.
    In: Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on
    Principles of database systems, pp. 51–61. (1997)
 3. Bouquet, Paolo and Giunchiglia, Fausto: Reasoning about theory adequacy. A
    new solution to the qualification problem, Fundamenta Informaticae, IOS Press.
    23(2,3,4), 247–262 (1995)
 4. Giunchiglia, F. and Maltese, V. and Dutta, B.: Domains and context: first steps
    towards managing diversity in knowledge, Journal of Web Semantics, Special Issue
    on Reasoning with Context in the Semantic Web, (2012)
 5. Knoblock, C.A., Szekely, P., Ambite, J.L., Goel, A., Gupta, S., Lerman, K., Muslea,
    M., Taheriyan, M., Mallick, P.: Semi-automatically mapping structured sources
    into the semantic web. In: Extended Semantic Web Conference, pp. 375–390.
    Springer, Heidelberg (2012)
 6. Knoblock, C.A., Szekely, P.: Exploiting semantics for big data integration. AI Mag-
    azine 36(1), 25–38 (2015)
 7. Leida, M., Gusmini, A., Davies, J.: Semantics-aware data integration for hetero-
    geneous data sources. Journal of Ambient Intelligence and Humanized Computing
    4(4), 471–491 (2013)
 8. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to
    WordNet: An on-line lexical database. International journal of lexicography 3(4),
    235–244 (1990)
 9. Giunchiglia, F., Fumagalli, M.: Teleologies: Objects, actions and functions. In: In-
    ternational Conference on Conceptual Modeling, pp. 520–534 (ER 2017). Springer,
    Cham (2017)
10. Millikan, R.: Language, thought, and other biological categories: New foundations
    for realism. MIT press. (1984)
11. Giunchiglia, F., Batsuren, K., Freihat, A.A.: One world–seven thousand languages.
    In: Proceedings 19th International Conference on Computational Linguistics and
    Intelligent Text Processing, CiCling2018, (2018)
12. Giunchiglia, F., Batsuren, K., Bella, G.: Understanding and exploiting language
    diversity. In: Proceedings of the 26th International Joint Conference on Artificial
    Intelligence, pp. 4009–4017. (2017)
13. Giunchiglia, F., Fumagalli, M.: Entity Type Recognition - dealing with the Di-
    versity of Knowledge. In: Seventeenth International Conference on Principles of
    Knowledge Representation and Reasoning. (2020)
14. Bella, G., Gremes, L., Giunchiglia, F.: Exploring the Language of Data. In: Pro-
    ceedings of the 28th International Conference on Computational Linguistics, pp.
    6638-6648. (2020).
15. Maltese, V., Giunchiglia, F.: Foundations of Digital Universities. Cataloging &
    Classification Quarterly 55(1), 26–50 (2017)
16. Bella, G., Zamboni, A., and Giunchiglia, F.: Domain-Based Sense Disambiguation
    in Multilingual Structured Data. DIVERSITY workshop, ECAI, 53 (2016).
17. Bella, G., Giunchiglia, F., McNeill, F.: Language and domain aware lightweight
    ontology matching. Journal of Web Semantics 43, 1–17 (2017)
18. Navigli, R., Ponzetto, S.P.: BabelNet: Building a very large multilingual semantic
    network. In: Proceedings of the 48th annual meeting of the association for compu-
    tational linguistics, pp. 216–225. (2010)
19. Euzenat, J., Shvaiko, P.: Ontology Matching. Vol. 18. Springer, Heidelberg (2007)
20. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab
    2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Sys-
    tems. In: European Semantic Web Conference, pp. 514–530. Springer, Cham (2020)
21. Iglesias, E., Jozashoori, S., Chaves-Fraga, D., Collarana, D. and Vidal, M.E.: SDM-
    RDFizer: An RML interpreter for the efficient creation of rdf knowledge graphs. In:
    Proceedings of the 29th ACM International Conference on Information Knowledge
    Management, pp. 3039-3046. (2020)
22. Jozashoori, S., Chaves-Fraga, D., Iglesias, E., Vidal, M.E. and Corcho, O.: FunMap:
    Efficient Execution of Functional Mappings for Knowledge Graph Creation. In:
    International Semantic Web Conference, pp. 276-293. Springer, Cham (2020)
23. Fensel, D., Şimşek, U., Angele, K., Huaman, E., Kärle, E., Panasiuk, O., Toma,
    I., Umbrich, J., Wahler, A.: Knowledge Graphs. 1st edn. Springer International
    Publishing, Switzerland (2020)
24. Kejriwal, M.: Domain-Specific Knowledge Graph Construction. 1st edn. Springer,
    Heidelberg (2019)
25. Bonatti, P.A., Decker, S., Polleres, A., Presutti, V.: Knowledge graphs: New direc-
    tions for knowledge representation on the semantic web (dagstuhl seminar 18371).
    Dagstuhl Reports 8(9) (2019)
26. Gagnon, M.: Ontology-based integration of data sources. In: 10th International
    Conference on Information Fusion, pp. 1–8. IEEE. (2007)
27. Gawriljuk, G., Harth, A., Knoblock, C.A. and Szekely, P.: A scalable approach to
    incrementally building knowledge graphs. In: International Conference on Theory
    and Practice of Digital Libraries , pp. 189–199. Springer, Cham (2016)
28. Cheatham, M., Pesquita, C.: Semantic data integration. Handbook of big data
    technologies (Springer). 263–305 (2017)
29. Mountantonakis, M., Tzitzikas, Y.: Large-scale semantic integration of linked data:
    A survey. ACM Computing Surveys (CSUR) 52(5), 1–40 (2019)
30. Färber, M., Ell, B., Menne, C., Rettinger, A.: A comparative survey of dbpedia,
    freebase, opencyc, wikidata, and yago. Semantic Web Journal 1(1), 1–5 (2015)
31. Ringler, D. and Paulheim, H.: One knowledge graph to rule them all? Analyz-
    ing the differences between DBpedia, YAGO, Wikidata & co. In: Joint Ger-
    man/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz), pp.
    366–372. Springer, Cham (2017)
32. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledge base. Com-
    munications of the ACM 57(10), 78–85 (2014)