=Paper=
{{Paper
|id=Vol-2400/paper-10
|storemode=property
|title=GAIA: A Framework for Schema Mapping Reuse (extended abstract)
|pdfUrl=https://ceur-ws.org/Vol-2400/paper-10.pdf
|volume=Vol-2400
|authors=Paolo Atzeni,Luigi Bellomarini,Paolo Papotti,Riccardo Torlone
|dblpUrl=https://dblp.org/rec/conf/sebd/AtzeniBPT19
}}
==GAIA: A Framework for Schema Mapping Reuse (extended abstract)==
GAIA: a Framework for Schema Mapping Reuse (DISCUSSION PAPER) Paolo Atzeni1 , Luigi Bellomarini2 , Paolo Papotti3 , and Riccardo Torlone1 1 Università Roma Tre, Italy, 2 Banca d’Italia, Italy, 3 EURECOM, France 1 Introduction Schema mappings are widely used as a tool for data exchange and integration. However, although there are systems supporting data architects in the creation of mappings [5], designing them is still a time-consuming task. In this framework, given the overwhelming amount of “enterprise knowledge” stored in traditional data warehouses and in data lakes, reuse is an opportunity of increasing im- portance [1]. In particular, data transformation scenarios are often defined over schemas that are different in structure but similar in semantics. This is especially true if data sources, which are extremely heterogeneous, have to be mapped to a shared format. It follows that a great opportunity to reduce the effort of trans- formation design is to reuse existing schema mappings. Unfortunately, there is no obvious approach for this problem. Consider the following example. Example 1. A central bank maintains a register with balance data from all com- panies in the country (Figure 1). This register has schema G, with a relation Balance storing, for each company, its gains, zone of operation, and economic sector. External providers send data to the bank in different forms. Provider A adopts a schema SA , with a relation RA for companies (firms), with gains, zone of operation, and economic sector, whose code refers to relation Activity. Provider B adopts a schema SB , with a relation RB for companies (enterprises), their gains, sector, capital, and area, whose code refers to relation Location. Data is moved from SA and SB into G, by using two schema mappings: σA : RA (f, g, z, s), Activity(s, d) → Balance(f, g, z, d). σB : RB (e, g, s, c, a), Location(a, n) → Balance(e, g, n, s). The example shows a data exchange scenario where the differences in the mappings are due to the structural differences between SA and SB , which are, on the other hand, semantically very similar. Moreover, every new data provider (e.g., SC in the figure) would require the manual design of a new, ad-hoc map- ping, even if there is a clear analogy with the already defined mappings. Our goal is to reuse σA and σB and avoid the definition of a new mapping for SC . The intuition is to collect all available mappings in a repository; then, Copyright c 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. SA RA SB RB Firm Gains Zone Sector Enterprise Gains Sector Capital Area ACME 15 Area A B8.9.1 EDEN 2 Sport 5 V12 CARS 36 Area S N77.1.1 FLOUR 6 Metals 3 M1 Activity Location FKey FKey Code Description Code Name B8.9.1 Chemical M1 Area M N77.1.1 Renting V12 Area V SC FKey G Balance RC FKey Company Gains Zone Sector Enterprise Sector Capital Profits Area ACME 15 Area A Chemical HOTELS M57 12 1 S14 CARS 36 Area S Renting PEARS C18 2 7 G2 EDEN 2 Area V Sport Activity Location FLOUR 6 Area M Metals Code Description Code Name HOTEL 1 Area S Housing C18 Media G2 Area G PEARS 7 Area G Media M57 Housing S14 Area S Fig. 1. A data transformation scenario for any new pair of schemas (e.g., SC and G in the figure), query such repos- itory to retrieve a suitable mapping. Unfortunately, this form of direct reuse is complicated by the nature of schema mappings. A mapping characterizes the constraint between a pair of schemas at a level of detail that enables both logical reasoning and efficient execution. Yet, a simple variation in a schema, such as a different relation or attribute name or a different number of attributes, makes it not applicable. Our experiments show that mappings from a corpus of 1.000 schema mappings can be reused for new, unmapped pairs of schemas only in 20% of cases. To be reusable, a mapping should be described in a way that is independent of its specificities but, at the same time, harnesses the essence of the constraint so as to work for similar schemas. Example 2. Consider a “generic” mapping ΣA , obtained from σA by replacing names of the relations and attributes with variables. It could be informally de- scribed as follows: ΣA : for each relation r with key f and attributes g, a, s for each relation r0 with key s and attribute d with a foreign key constraint from s of r to s of r0 there exists a relation r00 with key f and attributes g, a, d. If instantiated on SA , the generic mapping ΣA expresses a mapping to G that is the same as σA . This solution seems a valid compromise between precision, i.e., the ability to express the semantics of the original mapping, and generality, as it can be applicable over different schemas. However, ΣA falls short of the latter requirement, as it is not applicable on SB . Indeed, there are no constraints on attribute g and a, and so they could be bound to any of Gains, Sector and Capital, incorrectly trying to map Capital into the target. Example 3. Consider now a more elaborated generic mapping that uses con- stants to identify attributes: H ΣB : for each relation r with key e and attributes g, s, c, a for each relation r0 with key a and attribute d with a foreign key constraint from a of r to a of r0 where g = Gains, s 6= Gains, s 6= Capital, c 6= Gains, c 6= Sector there exists a relation r00 with key e and attributes g, d, s. This generic mapping is precise enough to correctly describe both σA and σB and can be re-used with other schemas. The example shows a combination of attributes, identified by constraints on their names and role, that form a correct and useful generic mapping. Once pinpointed, generic mappings can be stored in a repository, so that it is possible to use them for a new scenario. In our example, given SC and G, the generic H mapping ΣB can be retrieved from the repository and immediately applied. There are three main challenges in this approach. - We need a clear characterization of what it means for a generic mapping to correctly describe and capture the semantics of an original schema mapping. - As a generic mapping is characterized by a combination of conditions on at- tribute names and roles, for a given schema mapping there is a combinatorial number of generic mappings. We need a mechanism to generate them. - For a new scenario (e.g., new schemas), there is an overwhelming number of generic mappings that potentially apply, with different levels of “suitability”. We need efficient tools to search through and choose among them. Fig. 2. The architecture of GAIA. In this work, we address the above challenges with GAIA, a system for mapping reuse. GAIA supports two tasks, as shown in Figure 2: (1) infer generic map- pings, called meta-mappings, from input schema mappings, and store them in a repository; (2) given a source and a target schema, return a ranked list of meta- mappings from the repository which are used to generate possible mappings between these schemas. GAIA provides the following key contributions: – The notion of fitness: a semantics to precisely characterize and check when a meta-mapping is suitable for a reuse scenario. – An algorithm to infer meta-mappings from schema mappings with an approach that extends previous efforts for the definition of schema mappings by example; this algorithm is used to populate a repository of meta-mappings supporting schema mapping reuse. – An approach to reuse based on: (i) the search, in the repository of available meta-mappings, for those that fit a new pair of source and target schemas and (ii) the construction, from the retrieved meta-mappings, of possible mappings to be proposed to the designer. Because of space limitation, algorithms and details are in the full version of the paper [2]. In the rest of this paper, we provide examples of meta-mappings (Section 2) and experimental results from an evaluation of our system with more than 20,000 real-world data transformations over 40,000 schemas (Section 3). 2 Mapping and meta-mappings We recall the notion of schema mapping [6] and introduce that of meta-mapping. While the former notion models specific transformations, the latter introduces an abstraction over mappings [7, 8] and models generic mappings between schemas. Building on these notions, we illustrate the functionalities of our system. Schema mappings. Let S (the source) and T (the target) be two relational schemas and let Inst(S) and Inst(T) denote the set of all possible instances of S and T, respectively. A (schema) mapping M for S and T is a binary relation over their instances, that is, M ⊆ Inst(S) × Inst(T) [3]. Without loss of generality, we consider mappings expressed by a single source- to-target tuple-generating-dependency (st-tgd) σ: ∀x(φ(x) → ∃yψ(x, y)) where x and y are two disjoint sets of variables, φ(x) (the left-hand-side, LHS) is a conjunction of atoms involving relations in S and ψ(x, y) (the right-hand- side, RHS) is a conjunction of atoms involving relations in T. The dependency represents a mapping M in the sense that (I, J) ∈ M if and only if (I, J) satisfies σ. In this case, J is called a solution of I under σ. We can compute a suitable J in polynomial time by applying the chase procedure to I using σ [6]: the result may have labeled nulls denoting unknown values and is called the universal solution, since it has a homomorphism to any possible solution J 0 , that is, a mapping h of the nulls into constants and nulls such that h(J) ⊆ J 0 . Example 4. Consider the schemas SA , SB and G of Figure 1, which we recall in Figure 3 with all the formalisms that will be discussed throughout the paper. A mapping between SA and G is the st-tgd σA discussed in the Introduction and reported in Figure 3 (quantifiers are omitted for the sake of readability). schemas SA = {RA (Firm, Gains, Zone, Sector ), Activity(Code, Description), RA .Sector 7→ Activity.Code} G = {Balance(Company, Gains, Zone, Sector )} SB = {RB (Enter prise, Gains, Sector , Capital, Area), Location(Code, Name), RB .Area 7→ Location.Code}} mappings σA : RA (f , g, z , s), Activity(s, d) → Balance(f , g, z , d). σB : RB (e, g, s, c, a), Location(a, n) → Balance(e, g, n, s). m-schemas Att Rel Key name in FKey name name in SA Gains RA name in refer RA Firm RA Zone RA Sector RA Activity Activity Code Activity Description Activity Att Rel Key name in FKey name name in Gains RB SB name in refer RB Enterprise RB Sector RB Area RB Location Location Code Location Capital RB Name Location Att Rel Key name in G name name in Gains Balance Balance Company Balance Zone Balance Sector Balance meta-mappings ΣA : Rel(R),Key(K1 , R),Att(A1 , R),Att(A2 , R),FKey(F, R, S),Rel(S),Key(K2 , S),Att(A3 , S) → Rel(T ),Key(K1 , T ),Att(A1 , T ),Att(A2 , T ),Att(A3 , T ) qS (x) = Rel(R), Key(K1 , R), Att(A1 , R), Att(A2 , R), Att(A3 , R), FKey(F, R, S), Rel(S), Key(K2 , S), Att(A4 , S) qT (x, y) = Rel(T ), Key(K1 , T ), Att(A1 , T ), Att(A2 , T ), Att(A4 , T ) ΣB : qS (x) → qT (x, y) P ΣB : qS (x), A1=Gains, A2 =Sector , A3 =Capital → qT (x, y) N ΣB : qS (x), A16=Sector , A1 6=Capital, A2 6=Gains, A2 6=Capital, A3 6=Gains, A3 6=Sector → qT (x, y) H ΣB : qS (x), A1=Gains, A2 6=Gains, A2 6=Capital, A3 6=Gains, A3 6=Sector → qT (x, y) Fig. 3. Schemas, m-schemas, mappings, and meta-mappings discussed along the paper. Intuitively, the application of the chase to the instance of SA using σA enforces this dependency by generating one tuple in the target for each pair of tuples in the source for which there is a binding to the LHS of the dependency. The result includes the first two tuples in relation Balance in Figure 1. Besides, a mapping from SB to G is represented by the s-t tgd σB in Figure 3. Meta-mappings. A meta-mapping describes generic mappings between rela- tional schemas and is defined as a mapping over the catalog of a relational database [8]. Specifically, in a relational meta-mapping, source and target are both defined over the following schema, called (relational) dictionary: Rel(name), Att(name, in), Key(name, in), FKey(name, in, refer ) (for the sake of simplicity, we consider here a simplified version of the relational model). An instance S of the dictionary is called m-schema and describes relations, attributes and con- straints of a (standard) relational schema S. Figure 3 shows the m-schemas of the schemas SA , SB , and G of the running example. We assume, hereinafter, that, given a schema, its corresponding m-schema is also given, and vice versa. A meta-mapping is expressed by means of an st-tgd over dictionaries that describes how the elements of a source m-schema map to the elements of a target m-schema. Example 5. Mapping σA of Example 4 can be expressed, at the dictionary level, by meta-mapping ΣA in Figure 3. This st-tgd describes a generic transformation that takes two source relations R and S linked by a foreign key F and generates a target relation T obtained by joining R and S on F that includes: the key K1 and the attributes A1 and A2 from relation R and the attribute A3 from S. Given a source m-schema S and a meta-mapping M , a target m-schema T is generated by applying the chase procedure to S using M . Example 6. The chase of SA using ΣA , both in Figure 3, generates the following target m-schema where ⊥R is a labelled null denoting a relation name. Att Rel Key name in name name in Gains ⊥R ⊥R Firm ⊥R Zone ⊥R Description ⊥R This m-schema describes the relational schema: R(Firm, Gains, Zone, Description) A meta-mapping operates at schema level rather than at data level and thus pro- vides a means for describing generic transformations. Subtleties could arise from the chase procedure in presence of existential quantifications in meta-mappings producing duplications of relations in the result. This is avoided by assuming that, for each existentially quantified variable, there is a target equality gener- ating dependency (egd) [6] ensuring that whenever two relations in the target have the same structure, then they coincide. From meta-mappings to mappings. Given a source schema S and a meta- mapping Σ, it is possible not only to generate a target schema by using the chase, as shown in Example 6, but also to automatically obtain a schema mapping σ that represents the specialization of Σ for S and T [8]. The schema to data exchange transformation (SD transformation) generates from S and Σ a com- plete schema mapping made of S, a target schema T (obtained by chasing the m-schema of S with the meta-mapping), and an s-t tgd σ between S and T. The correspondences between LHS and RHS of σ are derived from the provenance in- formation computed during the chase step, in the same fashion as the provenance computed over the source instance when chasing schema mappings [4]. Example 7. Consider again the scenario in Figure 3. If we apply the SD trans- formation to the schema SA and the meta-mapping ΣA , we obtain the target m-schema of Example 6 and the following mapping from SA to ⊥R : σ : RA (f, g, z, s), Activity(s, d) → ⊥R (f, g, z, d). Thus, we get back, up to a renaming of the target relation, the mapping σA in Figure 3 from which ΣA originates. From mappings to meta-mappings While previous work focused on gen- erating a mapping from a given meta-mapping [8], we tackle the more general problem of mapping reuse, which consists of: (i) generating a repository of meta- mappings from a set of user-defined schema mappings, and (ii) given a new pair of source and target schemas, generating a suitable mapping for them from the repository of meta-mappings. Example 8. For the scenario in Figure 3, our system first generates several fitting meta-mappings from σA between SA and G. Once SB and G are given as input for a new transformation, the system scans the corpus of existing meta-mappings and identifies ΣA as a fitting meta-mapping from SB to G. This meta-mapping is then instantiated to generate a schema mapping between them, according to the SD transformation above. 3 Experimental evaluation We implemented GAIA in PL/SQL 11.2 for Oracle 11g. All experiments were conducted on Oracle Linux, with an Intel Core i7@2.60GHz, 16GB RAM. Datasets and Transformations. We used data transformations that are peri- odically executed to store data coming from several data sources into the Central National Balance Sheet (CNBS) database, an archive of financial information of about 40K enterprises. The schema of CNBS has 5 relations with roughly 800 at- tributes in total. Source data come from three different providers. The Chamber of Commerce provides data for 20K companies (datasets Chamber). While the schemas of these datasets are similar, the differences require one mapping per company between relations of up to 30 attributes with an average of 32 (self) joins in the LHS of the mappings. Data for 20K more companies is collected by a commercial data provider in a single database (CDP). This database is different both in structure and attributes names from the schemas in Chamber and requires one mapping involving 15 joins in its LHS. Finally, data for fur- ther 1,300 companies (Stock) is imported from the stock exchange, with each company requiring a mapping with 40 joins on average. Transformation scenarios. For the inference of the meta-mappings, we con- sider two configurations: single, where a meta-mapping is inferred from one map- ping, and multiple, where the inference is computed on a group of mappings for companies in the same business sector. We observe that the results stabilize with 10 schema mappings in the group and do not improve significantly with larger numbers, therefore consider 10 mappings for the multiple configuration. Search precision. Let σ be a mapping from a source S to a target T and let Q be the set of meta-mappings inferred from σ. We measure, in terms of search precision, the ability of the system to return the meta-mappings in Q when queried with S and T, i.e., how well the system retrieves correct cases. We use a resubstitution approach: we populate an initially empty repository by inferring the meta-mappings from a set Θ of mappings. For each schema mapping σ in Θ, we will denote by Qσ all the meta-mappings inferred from σ. We then pick a schema mapping σ 0 from Θ and search the repository by providing as input the source and target schemas of σ 0 . We then collect the top- 10 results according to the coverage and compute the percentage of correctly retrieved meta-mapping, that is, those that belong to Qσ0 . We test search precision with a corpus of mappings of increasing size (from 200 to 21,301). In each test, we query the repository of meta-mappings inferred from the given mappings by using 50 different source-target pairs and report their average search precision. The experiment is executed on single and mul- tiple configurations. The largest repository contains about 700K explicit meta- mappings in single configuration and 110K in multiple configuration. In the multiple scenario, the test is considered successful on a mapping σ when the retrieved meta-mapping originates from the group that includes σ. On average, a meta-mapping includes 12 equality and/or inequality constraints. The plot in Figure 4 shows that search precision decreases with the size of the repository as larger num- bers of mappings lead to an increas- ing number of false positives in the result. We report the search precision for the CDP mapping with a separate line. When this mapping is inserted in the repository (after 10K transfor- mations), it is immediately identified Fig. 4. Search precision. with both configurations. This shows that specific structures and constants lead to meta-mappings that are easy to retrieve. Meta-mappings derived from multiple schema mappings improve search precision because are more general than the ones from single mappings, thus reducing the risk of overfitting. References 1. D. Abadi et al. The Beckman report on database research. Commun. ACM, 59(2):92–99, Jan. 2016. 2. P. Atzeni, L. Bellomarini, P. Papotti, and R. Torlone. Meta-mappings for schema mapping reuse. PVLDB, 12(5):557–569, 2019. 3. P. A. Bernstein and S. Melnik. Model management 2.0: manipulating richer map- pings. In SIGMOD, 2007. 4. L. Chiticariu and W. C. Tan. Debugging schema mappings with routes. In VLDB, pages 79–90, 2006. 5. R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling, 2009. 6. R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data exchange: Semantics and query answering. In ICDT, 2003. 7. M. A. Hernández, P. Papotti, and W. C. Tan. Data exchange with data-metadata translations. PVLDB, 1(1):260–273, 2008. 8. P. Papotti and R. Torlone. Schema exchange: Generic mappings for transforming data and metadata. Data Knowl. Eng., 68(7):665–682, 2009.