Bi-Level Mapping: Combining Schema and Data Level Heterogeneity in Peer Data Sharing Systems

Bi-Level Mapping: Combining Schema and Data Level Heterogeneity in Peer Data Sharing Systems Md. MehediMdAnisurRahman mrahman@site.uottawa.ca Site University of Ottawa

800 King Edward Road Ottawa Canada

IlujuMasud mmasud@site.uottawa.ca Site University of Ottawa

800 King Edward Road Ottawa Canada

AbdulmotalebElKiringa kiringa@site.uottawa.ca Site University of Ottawa

800 King Edward Road Ottawa Canada

Saddik elsaddik@site.uottawa.ca Site University of Ottawa

800 King Edward Road Ottawa Canada

Bi-Level Mapping: Combining Schema and Data Level Heterogeneity in Peer Data Sharing Systems 0270F534938126DAA54660790B37054F GROBID - A machine learning software for extracting information from scholarly documents Model of Peer Data Sharing System Schema mapping Data mapping Query Translation

Peer data sharing systems use either schema-level or data-level mappings to resolve schema as well as data heterogeneity among data sources (peers). Schema-level mappings create structural relationships among different schemas. On the other hand, data-level mappings associate data values in two different sources. These two kinds of mappings are complementary to each other. However, existing peer database systems have been based solely on either one of these mappings. We believe that if both mappings are addressed simultaneously in a single framework, the resulting approach will enhance data sharing in a way such that we can overcome the limitations of the non-combined approaches. In this paper, we introduce a model of a peer database management system(PDMS) which uses a mapping that combines schema-level and data-level mappings. We call this new kind of mapping bi-level mapping. We present the syntax and semantics of bi-level mappings. We also provide a query evaluation procedure for the PDMS that uses the bi-level mappings.

Introduction

Designing a system for integrated access to distributed and heterogeneous information sources, e.g. federated database systems, distributed database systems, and peer database management systems is an important research area. The main goal of any data integration system [10,7,6,5,4,3,14] is to combine data residing at different sources, and to give users an amalgamated view of these data. In a federated database system, a global schema is defined against the data sources of all sites which gives a unique logical view to the global users for accessing the underlying data sources [2]. Users pose a query on the global schema and the query is decomposed into subqueries that are distributed to the respective sites. The distribution is based on the mappings and the query vocabularies. Generally, the mappings between the sources and the global schema are established by schema mappings. The underlying assumption is that there is no data-level heterogeneity among the sources. Several strategies are used for defining the schema mappings between the mediated and local schemas including, global-asview (GAV), local-as-view (LAV), and global-and-local-as-view (GLAV) [8]. In GAV Fig. 1. General scenario of P2P system approach, the mediated schema is described in terms of local sources. In LAV the local sources are described in terms of the mediated schema. GLAV is a combination of the two approaches where both GAV and LAV are used to integrate the mediated and local schema.

However, in a P2P system, there is no global schema, and it is not feasible to create a global schema for all the sources due to dynamic nature of peers. In addition, the heterogeneity among data sources in peers may be schema-level and data-level. A query in a P2P system is posed against the local schema and the query is translated for its neighbors based on the mappings [12]. In order to retrieve results of the query from the network, the query repetitively follows mappings through the network of peers until all relevant peers are visited [15]. When the query is executed in a peer, the partial result is returned to the query initiator. Finally, all the partial results are assembled to get an overall answer. During the execution of a query, a peer may play multiple roles. It may act as a data provider, a controller (acting like global component in a federated system), and a mediator (passing along queries without contributing to the result) [13].

A sample scenario of a peer-to-peer (P2P) system is depicted in Figure 1. As shown in figure, each peer has its own local source. The local source in a peer is designed independently during the creation of the local database of that peer. In order to provide a unique access view of local as well as remote data to the users, each peer defines a schema called peer schema. The result of a local query posed in a peer is produced from the local source of the peer. A peer also defines external sources that are used to access data from its acquainted peers. An external source is a view of a peer schema of an acquainted peer and is defined through some GLAV mappings called P2P mappings or peer mappings. Mapping tables may be used in P2P mappings for resolving data-level heterogeneity. Presence of mapping tables on the dotted lines (i.e. P2P mappings) in Figure 1 expresses the fact that, when data crosses the border between the source and destination peer, it is changed in the data-level using the mapping tables associated to the P2P mappings. Peer schema is defined in terms of the local sources and external sources by some GLAV [9,8] mappings. These mappings are called local mappings.

There are some advantages of using external sources instead of making direct link between the peer schemas. Firstly, it can tackle the dynamic nature of a P2P system with ease. Whenever a peer finds an acquaintance peer for the first time, it creates some external sources and links it both to its own peer schema and with the peer schema of the acquainted peer. Thus the external source can be treated by the target peer in the same way as local sources. When the source peer of an external source becomes unavailable in the network, it simply becomes an empty relation. Once an external source is created, any time the corresponding source peer becomes available it gets activated. Secondly, treating external sources as local sources, peer schema can be defined in a straight forward and autonomous way.

Since peers are fully autonomous to design their database and store data with their own format, heterogeneity among the peers may come in two forms: (i) schema-level and (ii) data-level. Notice that by creating the mappings through GLAV, the schemalevel heterogeneity can be resolved. Authors in [16] proposed such a scheme for creating mappings between peers that resolve schema-level heterogeneity. However, the GLAV approach is not sufficient to resolve the data-level heterogeneity among peers. On the other hand, mapping table [11], used for resolving data-level heterogeneity between peers, is not sufficient for resolving schema-level heterogeneity.

Motivating Example: Need For Bi-level Mappings

Consider two peers P 1 and P 2 in a P2P system as shown in Figure 2. Assume that peers store employee information to be shared with each other. P 1 has a local source Empl List(Id, Name, Position, salary). The attributes Id, Name, Position, and Salary represent identification number, name, position, and the salary of an employee, respectively. Similarly, P 2 stores its employee information by the local sources Employee(Id, Name, Jid) and Job Desc(Jid, Job Description). Attributes Jid and Job Description represent the job identification number and the title of the job, respectively. Now, assume that P 1 has an external source E(Name, Position) that illustrates that peer P 1 is interested in only the names and positions of employees stored in P 2 . In order to give a unique access view to the data stored in P 1 and P 2 , peer P 1 defines a peer schema P S1 which contains a single view N P (Name, Position) for its users. Similarly, peer P 2 creates its peer schema P S2 which contains two views P2E(Name, Jid) and P2J(Jid, Job Description). This schema is designed considering its local source and other external sources from other peers (not mentioned in the figure).

From the Figure 2, we also notice that P 1 has a mapping table mt that maps the data vocabularies of the attribute Job Description in source Job Desc of P 2 with the attribute Position in source Empl List. This mapping table is created since the two peers store job information using two different vocabularies. External source E in P 1 is defined as a view on the relations of the peer schema of P 2 . In the following, we illustrate different situations that may occur when a query is posed to a peer. The examples show the need for the bi-level mappings that this paper advocates for P2P systems.

Example 1 (Considering only schema mappings). Assume that the mapping table mt

is absent in peer P 2 . In this case, peer P 1 has only the schema-level mappings with P 2 . Suppose the query

q 1 : π N ame (σ P osition= CEO (N P ))

is posed at P 1 through its peer schema. Considering the mappings between N P and Empl List, q 1 is translated for the local source at P 1 as Fig. 2. Motivating example q 1 1 : π N ame (σ P osition= CEO (Empl List)). Moreover, using the mappings between N P and E, q 1 is translated for the external source at P 1 as q 1 1 : π N ame (σ P osition= CEO (E)). Based on the mappings between E and the peer schema at P 2 , query q 1 1 is translated as q 2 1 : π N ame (σ Job Description= CEO (P 2E P 2J)) which is finally translated according to the local vocabulary of P 2 as q 2 1 : π N ame (σ Job Description= CEO (Employee Job Desc)) Notice that the final result of the query q 1 is { Rameen} which is returned only from the local source at P 1 . If we consider 'CEO' and 'Chief Executive Officer' to be semantically equivalent then q 1 should extract 'Alina' from P 2 . Due to absence of data-level mappings, the query can not produce this result. Now assume that the mapping table mt exists. Hence, q 1 1 is translated for P 2 as q 2 1 : π N ame (σ P osition= CEO (Employee Job Desc mt)). In this case, we get more results for the query q 1 and the complete answer to this query becomes { Rameen, Alina} Example 2 (Considering only data mappings). In Figure 2 the external source E of P 1 is defined in terms of P 2E and P 2J of peer P 2 . P rojection and Join operators are used in that definition. Mapping tables are not expressive enough to express P rojection or Join. Schema mappings are needed for such association between two sources. So, a mapping is necessary that is capable of dealing with both the syntactical (schemalevel) and the semantic (data-level) heterogeneities at the same time.

In this paper,we present a new, generalized kind of mappings that combines both the schema-level as well as data-level mappings. We call these mappings bi-level mappings. We present semantics of the bi-level mappings and we give a model of a PDMS which allows bi-level mappings as a mean of bridging the heterogeneity gap between peers. If a user of a peer wants data from the local as well as from other peers in the network, she needs to pose the query on her local peer schema. We explain the query evaluation procedure by showing how the underlying bi-level mappings can be used to rewrite the query for execution in the local database and in remote ones.

The paper is organized as follows. Section 2 defines a model for a PDMS. In Section 3, we give the evaluation procedure of PDMS queries. Finally, in Section 4, we conclude with some future directions.

Model of a PDMS

A P2P system Π is defined by a pair < P, M >, where

P = {P 1 , P 2 , • • • , P n } is a set of peers and M = {M 1 , • • • , M n }

is a set of peer mappings. A set of mappings M j i ⊆ M i in P i defines the mappings between peer P i and P j . The construction of mappings M j i forms an acquaintance (i, j) between P i and P j . Suppose a P2P system Π =< P, M >. Formally, a peer P i ∈ P is a tuple, where

P i = (P S i , R i , L i , M i ).

Where, -P S i is the peer schema through which data in a peer is exposed to the external world. -R i is the set of sources comprised of local and external sources. We call it peer source or simply source. -L i is the set of GLAV local mappings which define the mappings between R i and P S i . Each local mapping, called mapping assertion (aka tuple generating dependency), in L i has the form ∀x(∃yϕ(x, y) ∃zψ(x, z))

where ϕ(x, y) and ψ(x, z) are conjunctive queries over the relations in R i and P S i respectively. -M i is a set of mappings, called bi-level mapping or peer mappings, that define the schema and data-level mappings between peers. Each mapping m ∈ M i is a pair < m S j,k , m D j,k >, where: -m S j,k is a GLAV mapping (practically GAV, since s(x) is always a single relation) of the form ∀x(∃yϕ(x, y) s(x)) where ϕ(x, y) is a conjunctive query over the peer schema of a peer P j and s(x) is the k th external source of P i . -m D j,k =MT={mt 1 , mt 2 , . . . , mt q } ⊆ M T i j is a set of mapping tables. M T i j denotes the set of mapping tables used to map data of P j to data of P i . m can alternatively be represented with the mapping assertion as follows:

∀x(∃yϕ(x, y) M T s(x))

We assumed that a curator with expertise in different domains is responsible for generating the mapping tables and the peer administrator maintains them in a peer. Schema mappings between two peers are initially created by the corresponding peer administrators when they agree to share data. Once created, the mappings are activated or deactivated depending on the presence of the corresponding peers in the network. Generating the mappings automatically is another research area and we did not address about this issue in this paper.

The semantics of M T is described in the following section.

Semantics of local mappings

For each peer P i , we introduce a first order logic (FOL) theory F i , called peer theory, where alphabet contains all the relation symbols in a peer schema P S i and the relations in R i . The axioms of F i include all the constraints of P S i and one logical formula representing each local mapping in L i . For a local mapping of the form ∀x(∃yϕ(x, y) ∃zψ(x, z)) in L i , a formula of the form ∀x(∃yϕ(x, y) ⊆ ∃zψ(x, z)) is added to F i . F i does not consider peer mappings. Thus, modeling a peer as a GLAV integration system becomes equivalent to modeling a FOL theory F i (ignoring the peer mappings in M i ).

Semantics of peer mappings

Similar to the local mappings, the semantics of peer mappings can also be given in terms of FOL. However, to incorporate the mapping tables, we will use notations and definitions provided below. Definition 5 (Augmentation f unction τ ). Let x be a tuple whose schema contains the attributes P . An augmentation function τ (x, P , Q, q) returns a tuple x , where x is exactly like x except the schema of x has the attributes Q in place of P and x [Q] = q. Definition 6. Let mt[P , Q] be a mapping table and x be a tuple, mt[x] denotes a set of tuples obtained by replacing the values of P attributes of x by the corresponding mapped values of Q in mt. Formally,

mt[x] ≡ {τ (x, P , Q, q) | q ∈ M ap(mt, x[P ])} Let M T = {mt 1 [P 1 , Q 1 ], mt 2 [P 2 , Q 2 ], . . . , mt n [P n , Q n ]

} be a set of mapping tables, where for any pairs of mapping tables

(mt i [P i , Q i ], mt j [P j , Q j ]), mt i = mt j =⇒ P i = P j , then M T [x]

denotes a set of tuples resulted from transformation of x by all the member mapping tables of M T . Formally,

M T [x] ≡ {τ (. . . τ (τ (x, P 1 , Q 1 , q 1 ), P 2 , Q 2 , q 2 ) . . . , P n , Q n , q n )| q 1 ∈ M ap(mt 1 , x[P 1 ]) ∧ . . . ∧ q n ∈ M ap(mt n , x[P n ])}

We have overloaded [ ] in many of our definitions; its meaning, however, will be clearly understood from the context. Definition 7. Let x be a tuple. Schema(x) returns the schema of that tuple, i.e. it returns the set of attributes whose values constituted the tuple. Schema() can be overloaded by providing its parameter as a relation/ view. In this case, it would return the schema of the relation. Now we define the semantics of peer mappings. We already mentioned that a mapping assertion of a peer mapping is of the form:

∀x(∃yϕ(x, y) M T s(x))

Let us assume that the above assertion defines an external source s of peer P i in terms of the peer schema of peer P j . We say that an interpretation of the schema of P i and P j satisfies the assertion if that interpretation satisfies the following FOL formula

∀x∀z(∃y(ϕ(x, y) ∧ z ∈ M T [x]) ≡ s(z))

We can interpret a mapping as a definition of how the data of the external source would be instantiated by the data of other peers. The formula also tells us that before instantiating the external source, data is converted using the corresponding mapping tables of M T . However, if there is no data-level heterogeneity, no mapping table is needed. In that case, an empty mapping table φ is used in the assertion. In that case, a peer mapping is represented as ∀x(∃yϕ(x, y) φ s(x)) which satisfies the FOL formula ∀x(∃yϕ(x, y) ≡ s(x)).

Example 3. Let us modify the peer mapping of Figure 2 by a bi-level mapping assertion m 1 which is expressed as follows.

π 1,4 (P 2E 2=3 P 2J) {mt} E

The new scenario is shown in Figure 3(a). To satisfy the assertion, the following formula has to be satisfied.

∀rtt (∃s(P 2E(r, s)

∧ P 2J(s, t) ∧ (r, t ) ∈ mt[(r, t)]) ≡ E(r, t ))

Given the source database in Figure 3(a), for satisfying the above formula, the extension of the intensional source E(N ame, P osition) has to be as shown in Figure 3(b). Figure 3(b) also shows the ultimate extension of the intentional relation N P (N ame, P osition) in the peer schema of peer P 1 . Consequently, in response to the query q 1 : π N ame (σ P osition= CEO (N P )) to peer P 1 , the PDMS will return the result {Rameen, Alina}.

Semantics of a P2P system

We give the semantics of a P2P system Π in terms of a set of models that satisfy the local and peer mappings of Π. Let a source database D for Π be a disjoint union of a set of local databases in each peer P i of Π. Given a source database D for Π, the set of models of Π relative to D is: Given a query q of arity k posed to a peer P i of Π, and a source database D, the certain answers to q relative to D are ans(q, Π, D) = {t|t ∈ q I , for every I ∈ sem D (Π)}

Query Evaluation

We adopt the gossiping mechanism for query execution [1]. When a query is posed to a peer it is executed in the local database of the peer and is forwarded to the acquaintances of the current peer. Whenever a peer gets a query forwarded by another peer, it executes and forwards the query to it's acquaintances causing in turn the further propagation of the query. This process continues until all the reachable peers have been processed or a fixed number of propagations of the initial query has occurred. Considering how a query propagates through the peers of a peer network, the peer network can be converted to a shortest path spanning tree. We call this tree a propagation tree. Figure 4(a) shows a peer network and the corresponding query propagation tree with respect to peer P 1 where the query originates. A single peer is visited only once per execution of a query, i.e. in the propagation tree of the query a peer can appear at most once. Note that the propagation tree is constructed dynamically and depends on how the peers are acquainted to one another. When a peer forwards the query to one of its acquaintance, the former is said to be the ancestor of the later in the tree. We assume that each query is defined w.r.t. the schema of a single peer (called initial peer). The initial peer executes the query in a straight forward fashion and also propagates it to its acquainted peers. At the time of propagation, the query is transformed to get compatible with the peer schema of the acquaintance peer. The local execution of the query and it's transformation and propagation to other peers goes on parallelly. The transformation of the query is unfolding the definition of the external sources defined in the peer mappings between two peers. Each peer in the propagation path gets the query result from it's descendant, merges the result with it's own result, and then back propagates to it's ancestor. When a result is back propagated to a peer, it is transformed according to the peer schema of the recipient peer, so that it can be merged with the result of the local result. Merging two results is done by simply taking the inner union of them. Since the results may need some semantic translation, instead of directly returning them to the initial peer, they are returned along the reverse way of query propagation. We borrowed the concept of local query and global query from [12]. A local query is executed using the data in the local peer. On the other hand, a global query uses the peer network to get the amalgamated result of the locally retrieved data (i.e. the result of the local queries). We now formalize the above notions. Consider a P2P system Π = (P, M) with P = {P 1 , P 2 , ..., P n }. Assume I i be an instance of P i . Let T ranQ(q, M j i ) be a function that translates the query q of peer P i to the vocabulary (both data and schema ) of peer schema of P j according to the peer mappings M j i between P i and P j . Now a global query q P i with respect to a specific query q i posed on the peer P i is defined as a set of queries {q 1 i , q 2 i , . . . , q n i , } where q j i is defined as follows:

q j i =                                qi If i = j T ranQ(q i , M j i ) If i = j

and there exists a mapping M j i between peers P j and P i T ranQ(q k i , M j k ) If i = j and P i is indirectly mapped to Pj through some intermediate peers and P k be the immediate predecessor of P j in the propagation path Let T ranV (V, M j i ) be a function that translates the view V of the peer P i to the vocabulary of the peer P j using the peer mapping M j i . Moreover, let q j i (I j ) denotes the view resulted from application of query q j i to the local instance I j of peer P j . Semantics of q j i is given above. The result of the global query Result(q P i ) is defined as Result(q P i ) = ∪ n j=1 I i i,j . Where I k i,j is defined as follows: Since all the local views are ultimately transformed according to the schema of the initial peer P i , an inner union can be taken on the translated results. Our approach differs from the approach of [12] where outer union is taken among the local views of peers which provides far less meaningful information. Obviously, computation time can be saved if two or more mappings can be composed together to generate a direct mapping between two peers. We will address this issue in a separate work.

I k i,j =                                q j i (I j ) If j = k T ranV (q j i (Ij), M i j ) If i = k = j

Example 4. Consider the example in Figure 4(b). A query q 1 , posted against the peer schema of P 1 , is as follows

q 1 : π N ame (σ Designation= CEO (P 1E))

results in a global query q P 1 = {q 1 1 , q 2 1 , q 3 1 } where: q 1 1 : π N ame (σ Designation= CEO (P 1E)) q 2 1 : π N ame (σ P osition= CEO (P 2E)) q 3 1 : π N ame (σ P osition= CEO (P 3E P 3J mt)) The queries q 1 1 ,q 2 1 , q 3 1 are executed on the local instances of the peers P 1 , P 2 , and P 3 , respectively to produce the results I 1 ,q 2 1 and q 3 1 Peer P 3 converts I 3 1,3 to I 2 1,3 using the mapping M 2 3 and sends it to peer P 2 . P 2 converts I 2 1,2 and I 2 1,3 to I 1 1,2 and I 1 1,3 , respectively using the mapping M 1 2 and sends it to peer P 1 . Note that mapping M j i is the inverse of mapping M i j . In Figure 4(b), all the mappings are not shown due to lack of space. The attribute Name is common in all the peers and so the converted views remains the same as the original views. P 1 combines the views I 1 1,1 , I 1 1,2 , and I 1 1,3 to produce the final result as {Anwar, Rameen, Alina}.

Conclusion

We have considered the problem of data sharing among heterogeneous data sources. The heterogeneity may come from differences in the schemas and/or from presenting the same (or semantically related) data with different vocabulary. Schema-level mapping can resolve the former while data-level mapping can be used for the later type of heterogeneity. In P2P systems, practically in most of the cases, both type of heterogeneity appears simultaneously. We present a peer mapping between two peers, called bi-level mapping, which can address both schema-level and data-level mappings at the same time. We give the semantics of bi-level mappings. Then we define a P2P system as a model that satisfies the local mappings as well as the bi-level mappings.

We did not take into account the local constraints that may not be satisfied by the external sources and consequently make the model inconsistent. Presence of cycles during evaluation of queries is another important issue that we did not address. Future research includes resolving these issues and finding algorithms for composition of the bi-level mappings.

Fig. 3 .3Fig. 3. Bi-Level Mapping Example

Fig. 4 .4Fig. 4. Query propagation tree and Merging example

Fig. 5 .5Fig. 5. Local results of q 11 ,q 2 1 and q 3

PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities FMCuenca-Acuna CPeery RPMartin TDNguyen 2003 HPDC Federated database systems for managing distributed, heterogeneous, and autonomous databases APSheth JALarson ACM Comput. Surv 22 3 1990 The Clio Project: Managing Heterogeneity RJMiller MHernndez LMHaas LYan CTHoward RFagin LPopa 2001 SIGMOD Record AutoMed: A BAV Data Integration System for Heterogeneous Data Sources MBoyd SKittivoravitkul CLazanitis PJMcbrien NRizopoulos 2004 CAiSE Information Integration Using Logical Views DAMaluf AAshish 2005 SIGMOD Information Integration Using Logical Views JDUllman ICDT 1997 The TSIMMIS Project: Integration of Heterogeneous Information Sources SChawathe HGarcia-Molina JHammer KIreland YPapakonstantinou JUllman JWidom 1994 IPSJ Data Integration: A Theoretical Prespective MLenzerini 2001 PODS Logical Foundations of Peer-To-Peer Data Integration DCalvanese GDGiacomo MLenzerini RRosati 2004 PODS The Hyperion Project: From Data Integration to Data Coordination MArenas VKantere AKementsietsidis IKiringa RJMiller JMylopoulos 2003 SIGMOD RECORD Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues AKemensietsidis MArenas ACM SIGMOD 2003 Data sharing through query translation in autonomous sources AKemensietsidis MArenas RJMiller 2004 VLDB Self-Extending Peer Data Management RHeese FNaumann ARoth BTW 2005 Querying Heterogeneous Information Sources Using Source Descriptions AYLevy ARajaraman JJOrdille 1996 VLDB Answering queries using views: A survey AHalevy VLDB Journal 10 2001 The Piazza Peer Data Management System AYHalevy ZGIves JMadhavan PMork DSuciu ITatarinov IEEE T.K.D.E 16 2004