<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GAIA: a Framework for Schema Mapping Reuse (DISCUSSION PAPER)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Atzeni</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Bellomarini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Papotti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universita Roma Tre</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Banca d'Italia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>6</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Schema mappings are widely used as a tool for data exchange and integration. However, although there are systems supporting data architects in the creation of mappings [5], designing them is still a time-consuming task. In this framework, given the overwhelming amount of \enterprise knowledge" stored in traditional data warehouses and in data lakes, reuse is an opportunity of increasing importance [1]. In particular, data transformation scenarios are often de ned over schemas that are di erent in structure but similar in semantics. This is especially true if data sources, which are extremely heterogeneous, have to be mapped to a shared format. It follows that a great opportunity to reduce the e ort of transformation design is to reuse existing schema mappings. Unfortunately, there is no obvious approach for this problem. Consider the following example. Example 1. A central bank maintains a register with balance data from all companies in the country (Figure 1). This register has schema G, with a relation Balance storing, for each company, its gains, zone of operation, and economic sector. External providers send data to the bank in di erent forms. Provider A adopts a schema SA, with a relation RA for companies ( rms), with gains, zone of operation, and economic sector, whose code refers to relation Activity. Provider B adopts a schema SB, with a relation RB for companies (enterprises), their gains, sector, capital, and area, whose code refers to relation Location. Data is moved from SA and SB into G, by using two schema mappings: A: RA(f; g; z; s); Activity(s; d) ! Balance(f; g; z; d): B: RB(e; g; s; c; a); Location(a; n) ! Balance(e; g; n; s): The example shows a data exchange scenario where the di erences in the mappings are due to the structural di erences between SA and SB, which are, on the other hand, semantically very similar. Moreover, every new data provider (e.g., SC in the gure) would require the manual design of a new, ad-hoc mapping, even if there is a clear analogy with the already de ned mappings. Our goal is to reuse A and B and avoid the de nition of a new mapping for SC . The intuition is to collect all available mappings in a repository; then,</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Firm
ACME
CARS</p>
      <p>Gains
15
36</p>
      <p>Zone Sector
Area A B8.9.1
Area S N77.1.1</p>
      <p>Enterprise Gains Sector Capital Area
EDEN 2 Sport 5 V12</p>
      <p>FLOUR 6 Metals 3 M1
Code
B8.9.1
N77.1.1</p>
      <p>Activity</p>
      <p>Description
Chemical
Renting</p>
      <p>FKey</p>
      <p>FKey
Code
M1
V12</p>
      <p>Location</p>
      <p>Name
Area M
Area V
SC FKey</p>
      <p>RC
Enterprise Sector Capital Profits Area
HOTELS M57 12 1 S14
PEARS C18 2 7 G2</p>
      <p>FKey</p>
      <p>Activity
Code Description
C18 Media
M57 Housing</p>
      <p>Location
Code Name
G2 Area G
S14 Area S</p>
      <p>G</p>
      <p>Balance
Company Gains Zone Sector
ACME 15 Area A Chemical
CARS 36 Area S Renting
EDEN 2 Area V Sport
FLOUR 6 Area M Metals
HOTEL
PEARS
for any new pair of schemas (e.g., SC and G in the gure), query such
repository to retrieve a suitable mapping. Unfortunately, this form of direct reuse
is complicated by the nature of schema mappings. A mapping characterizes the
constraint between a pair of schemas at a level of detail that enables both logical
reasoning and e cient execution. Yet, a simple variation in a schema, such as a
di erent relation or attribute name or a di erent number of attributes, makes
it not applicable. Our experiments show that mappings from a corpus of 1.000
schema mappings can be reused for new, unmapped pairs of schemas only in
20% of cases. To be reusable, a mapping should be described in a way that is
independent of its speci cities but, at the same time, harnesses the essence of
the constraint so as to work for similar schemas.</p>
      <p>Example 2. Consider a \generic" mapping A, obtained from A by replacing
names of the relations and attributes with variables. It could be informally
described as follows:</p>
      <p>A: for each relation r with key f and attributes g, a, s
for each relation r0 with key s and attribute d</p>
      <p>with a foreign key constraint from s of r to s of r0
there exists a relation r00 with key f and attributes g, a, d.</p>
      <p>If instantiated on SA, the generic mapping A expresses a mapping to G that
is the same as A. This solution seems a valid compromise between precision,
i.e., the ability to express the semantics of the original mapping, and generality,
as it can be applicable over di erent schemas. However, A falls short of the
latter requirement, as it is not applicable on SB. Indeed, there are no constraints
on attribute g and a, and so they could be bound to any of Gains, Sector and
Capital, incorrectly trying to map Capital into the target.</p>
      <p>Example 3. Consider now a more elaborated generic mapping that uses
constants to identify attributes:</p>
      <p>BH : for each relation r with key e and attributes g, s, c, a
for each relation r0 with key a and attribute d
with a foreign key constraint from a of r to a of r0
where g = Gains; s 6= Gains; s 6= Capital; c 6= Gains; c 6= Sector
there exists a relation r00 with key e and attributes g, d, s.</p>
      <p>This generic mapping is precise enough to correctly describe both
and can be re-used with other schemas.</p>
      <p>A and</p>
      <p>B</p>
      <p>The example shows a combination of attributes, identi ed by constraints on
their names and role, that form a correct and useful generic mapping. Once
pinpointed, generic mappings can be stored in a repository, so that it is possible
to use them for a new scenario. In our example, given SC and G, the generic
mapping BH can be retrieved from the repository and immediately applied.</p>
      <p>There are three main challenges in this approach.
- We need a clear characterization of what it means for a generic mapping to
correctly describe and capture the semantics of an original schema mapping.
- As a generic mapping is characterized by a combination of conditions on
attribute names and roles, for a given schema mapping there is a combinatorial
number of generic mappings. We need a mechanism to generate them.
- For a new scenario (e.g., new schemas), there is an overwhelming number of
generic mappings that potentially apply, with di erent levels of \suitability".
We need e cient tools to search through and choose among them.</p>
      <p>In this work, we address the above challenges with GAIA, a system for mapping
reuse. GAIA supports two tasks, as shown in Figure 2: (1) infer generic
mappings, called meta-mappings, from input schema mappings, and store them in a
repository; (2) given a source and a target schema, return a ranked list of
metamappings from the repository which are used to generate possible mappings
between these schemas. GAIA provides the following key contributions:
{ The notion of tness: a semantics to precisely characterize and check when a
meta-mapping is suitable for a reuse scenario.
{ An algorithm to infer meta-mappings from schema mappings with an approach
that extends previous e orts for the de nition of schema mappings by example;
this algorithm is used to populate a repository of meta-mappings supporting
schema mapping reuse.
{ An approach to reuse based on: (i) the search, in the repository of available
meta-mappings, for those that t a new pair of source and target schemas and
(ii) the construction, from the retrieved meta-mappings, of possible mappings
to be proposed to the designer.</p>
      <p>
        Because of space limitation, algorithms and details are in the full version of
the paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the rest of this paper, we provide examples of meta-mappings
(Section 2) and experimental results from an evaluation of our system with more
than 20,000 real-world data transformations over 40,000 schemas (Section 3).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Mapping and meta-mappings</title>
      <p>
        We recall the notion of schema mapping [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and introduce that of meta-mapping.
While the former notion models speci c transformations, the latter introduces an
abstraction over mappings [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and models generic mappings between schemas.
Building on these notions, we illustrate the functionalities of our system.
Schema mappings. Let S (the source) and T (the target) be two relational
schemas and let Inst(S) and Inst(T) denote the set of all possible instances of
S and T, respectively. A (schema) mapping M for S and T is a binary relation
over their instances, that is, M Inst(S) Inst(T) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Without loss of generality, we consider mappings expressed by a single
sourceto-target tuple-generating-dependency (st-tgd) : 8x( (x) ! 9y (x; y)) where
x and y are two disjoint sets of variables, (x) (the left-hand-side, LHS) is
a conjunction of atoms involving relations in S and (x; y) (the
right-handside, RHS) is a conjunction of atoms involving relations in T. The dependency
represents a mapping M in the sense that (I; J ) 2 M if and only if (I; J ) satis es
. In this case, J is called a solution of I under . We can compute a suitable J in
polynomial time by applying the chase procedure to I using [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: the result may
have labeled nulls denoting unknown values and is called the universal solution,
since it has a homomorphism to any possible solution J 0, that is, a mapping h
of the nulls into constants and nulls such that h(J ) J 0.
      </p>
      <p>Example 4. Consider the schemas SA, SB and G of Figure 1, which we recall in
Figure 3 with all the formalisms that will be discussed throughout the paper.</p>
      <p>A mapping between SA and G is the st-tgd A discussed in the Introduction
and reported in Figure 3 (quanti ers are omitted for the sake of readability).
schemas
mappings
m-schemas</p>
      <p>Att
name in
Gains RA</p>
      <p>Zone RA
Description Activity</p>
      <p>Att
name in
Gains RB
Sector RB
Capital RB</p>
      <p>Name Location
SA</p>
      <p>Rel
name</p>
      <p>RA
Activity</p>
      <p>Key
name in
Firm RA</p>
      <p>Code Activity
SB</p>
      <p>Rel
name</p>
      <p>RB
Location</p>
      <p>Key
name
Enterprise</p>
      <p>Code
in</p>
      <p>RB
Location</p>
      <p>FKey
name in refer
Sector RA Activity</p>
      <p>FKey
name in refer
Area RB Location
Intuitively, the application of the chase to the instance of SA using A enforces
this dependency by generating one tuple in the target for each pair of tuples in
the source for which there is a binding to the LHS of the dependency. The result
includes the rst two tuples in relation Balance in Figure 1. Besides, a mapping
from SB to G is represented by the s-t tgd B in Figure 3.</p>
      <p>
        Meta-mappings. A meta-mapping describes generic mappings between
relational schemas and is de ned as a mapping over the catalog of a relational
database [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Speci cally, in a relational meta-mapping, source and target are
both de ned over the following schema, called (relational) dictionary : Rel(name),
Att(name; in), Key(name; in), FKey(name; in; refer ) (for the sake of simplicity,
we consider here a simpli ed version of the relational model). An instance S of
the dictionary is called m-schema and describes relations, attributes and
constraints of a (standard) relational schema S. Figure 3 shows the m-schemas of
the schemas SA, SB, and G of the running example.
      </p>
      <p>We assume, hereinafter, that, given a schema, its corresponding m-schema is
also given, and vice versa. A meta-mapping is expressed by means of an st-tgd
over dictionaries that describes how the elements of a source m-schema map to
the elements of a target m-schema.</p>
      <p>Example 5. Mapping A of Example 4 can be expressed, at the dictionary level,
by meta-mapping A in Figure 3. This st-tgd describes a generic transformation
that takes two source relations R and S linked by a foreign key F and generates
a target relation T obtained by joining R and S on F that includes: the key K1
and the attributes A1 and A2 from relation R and the attribute A3 from S.
Given a source m-schema S and a meta-mapping M , a target m-schema T is
generated by applying the chase procedure to S using M .</p>
      <p>Example 6. The chase of SA using A, both in Figure 3, generates the following
target m-schema where ?R is a labelled null denoting a relation name.</p>
      <p>Rel
name
?R</p>
      <p>Key
name in
Firm ?R</p>
      <p>Att
name in
Gains ?R</p>
      <p>
        Zone ?R
Description ?R
This m-schema describes the relational schema: R(Firm; Gains; Zone; Description)
A meta-mapping operates at schema level rather than at data level and thus
provides a means for describing generic transformations. Subtleties could arise from
the chase procedure in presence of existential quanti cations in meta-mappings
producing duplications of relations in the result. This is avoided by assuming
that, for each existentially quanti ed variable, there is a target equality
generating dependency (egd) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] ensuring that whenever two relations in the target
have the same structure, then they coincide.
      </p>
      <p>
        From meta-mappings to mappings. Given a source schema S and a
metamapping , it is possible not only to generate a target schema by using the chase,
as shown in Example 6, but also to automatically obtain a schema mapping
that represents the specialization of for S and T [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The schema to data
exchange transformation (SD transformation) generates from S and a
complete schema mapping made of S, a target schema T (obtained by chasing the
m-schema of S with the meta-mapping), and an s-t tgd between S and T. The
correspondences between LHS and RHS of are derived from the provenance
information computed during the chase step, in the same fashion as the provenance
computed over the source instance when chasing schema mappings [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Example 7. Consider again the scenario in Figure 3. If we apply the SD
transformation to the schema SA and the meta-mapping A, we obtain the target
m-schema of Example 6 and the following mapping from SA to ?R:
: RA(f; g; z; s); Activity(s; d) ! ?R(f; g; z; d):
Thus, we get back, up to a renaming of the target relation, the mapping
Figure 3 from which A originates.
      </p>
      <p>
        A in
From mappings to meta-mappings While previous work focused on
generating a mapping from a given meta-mapping [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we tackle the more general
problem of mapping reuse, which consists of: (i) generating a repository of
metamappings from a set of user-de ned schema mappings, and (ii) given a new pair
of source and target schemas, generating a suitable mapping for them from the
repository of meta-mappings.
      </p>
      <p>Example 8. For the scenario in Figure 3, our system rst generates several tting
meta-mappings from A between SA and G. Once SB and G are given as input
for a new transformation, the system scans the corpus of existing meta-mappings
and identi es A as a tting meta-mapping from SB to G. This meta-mapping
is then instantiated to generate a schema mapping between them, according to
the SD transformation above.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental evaluation</title>
      <p>We implemented GAIA in PL/SQL 11.2 for Oracle 11g. All experiments were
conducted on Oracle Linux, with an Intel Core i7@2.60GHz, 16GB RAM.
Datasets and Transformations. We used data transformations that are
periodically executed to store data coming from several data sources into the Central
National Balance Sheet (CNBS) database, an archive of nancial information of
about 40K enterprises. The schema of CNBS has 5 relations with roughly 800
attributes in total. Source data come from three di erent providers. The Chamber
of Commerce provides data for 20K companies (datasets Chamber). While the
schemas of these datasets are similar, the di erences require one mapping per
company between relations of up to 30 attributes with an average of 32 (self)
joins in the LHS of the mappings. Data for 20K more companies is collected
by a commercial data provider in a single database (CDP). This database is
di erent both in structure and attributes names from the schemas in Chamber
and requires one mapping involving 15 joins in its LHS. Finally, data for
further 1,300 companies (Stock) is imported from the stock exchange, with each
company requiring a mapping with 40 joins on average.</p>
      <p>Transformation scenarios. For the inference of the meta-mappings, we
consider two con gurations: single, where a meta-mapping is inferred from one
mapping, and multiple, where the inference is computed on a group of mappings for
companies in the same business sector. We observe that the results stabilize with
10 schema mappings in the group and do not improve signi cantly with larger
numbers, therefore consider 10 mappings for the multiple con guration.
Search precision. Let be a mapping from a source S to a target T and let
Q be the set of meta-mappings inferred from . We measure, in terms of search
precision, the ability of the system to return the meta-mappings in Q when
queried with S and T, i.e., how well the system retrieves correct cases.</p>
      <p>We use a resubstitution approach: we populate an initially empty repository
by inferring the meta-mappings from a set of mappings. For each schema
mapping in , we will denote by Q all the meta-mappings inferred from
. We then pick a schema mapping 0 from and search the repository by
providing as input the source and target schemas of 0. We then collect the
top10 results according to the coverage and compute the percentage of correctly
retrieved meta-mapping, that is, those that belong to Q 0 .</p>
      <p>We test search precision with a corpus of mappings of increasing size (from
200 to 21,301). In each test, we query the repository of meta-mappings inferred
from the given mappings by using 50 di erent source-target pairs and report
their average search precision. The experiment is executed on single and
multiple con gurations. The largest repository contains about 700K explicit
metamappings in single con guration and 110K in multiple con guration. In the
multiple scenario, the test is considered successful on a mapping when the
retrieved meta-mapping originates from the group that includes . On average,
a meta-mapping includes 12 equality and/or inequality constraints.</p>
      <p>The plot in Figure 4 shows that
search precision decreases with the
size of the repository as larger
numbers of mappings lead to an
increasing number of false positives in the
result. We report the search precision
for the CDP mapping with a separate
line. When this mapping is inserted
in the repository (after 10K
transformations), it is immediately identi ed
Fig. 4. Search precision. with both con gurations. This shows
that speci c structures and constants
lead to meta-mappings that are easy to retrieve. Meta-mappings derived from
multiple schema mappings improve search precision because are more general
than the ones from single mappings, thus reducing the risk of over tting.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Abadi</surname>
          </string-name>
          et al.
          <source>The Beckman report on database research. Commun. ACM</source>
          ,
          <volume>59</volume>
          (
          <issue>2</issue>
          ):
          <volume>92</volume>
          {
          <fpage>99</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Atzeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          .
          <article-title>Meta-mappings for schema mapping reuse</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>12</volume>
          (
          <issue>5</issue>
          ):
          <volume>557</volume>
          {
          <fpage>569</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          .
          <source>Model management 2</source>
          .
          <article-title>0: manipulating richer mappings</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Debugging schema mappings with routes</article-title>
          .
          <source>In VLDB</source>
          , pages
          <volume>79</volume>
          {
          <fpage>90</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          . Clio:
          <article-title>Schema mapping creation and data exchange</article-title>
          .
          <source>In Conceptual Modeling</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: Semantics and query answering</article-title>
          .
          <source>In ICDT</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Data exchange with data-metadata translations</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>260</volume>
          {
          <fpage>273</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          .
          <article-title>Schema exchange: Generic mappings for transforming data and metadata</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>68</volume>
          (
          <issue>7</issue>
          ):
          <volume>665</volume>
          {
          <fpage>682</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>