Introduction

Fagin, R., Kolaitis, P. G., Popa, L., Tan, W. C.: Composing schema mappings: Second-order dependencies to the rescue., ACM Trans. Database Syst.

XML Schema Mappings in the Presence of Key Constraints and Value Dependencies ?

Institute of Control

Information Engineering

0 Faculty of Mathematics and Computer Science, Adam Mickiewicz University , Poznan

2006

30 4

Schema mappings play a central role in both data integration and data exchange, and are understood as high-level speci¯cations describing the relationships between data schemas. Based on these speci¯cations, data structured under a source schema can be transformed into data structured under a target schema. During the transformation some structural constraints, both context-free (the structure) and contextual (e.g. keys and value dependencies) should be taken into account. In this work, we present a new formalism for the schema mapping speci¯cation. We propose a new class of tree-pattern formulas in order to extend semantics of XML schema mappings by speci¯cation of key constraints and value dependencies. We discuss foundations of the method and propose a key-preserving transformation algorithm.

Introduction

Many problems in information integration and data exchange need schema mappings, which are high-level speci¯cations describing the relationships between source and target data schemas. A schema mapping over a pair of schemas expresses a relation on the set of instances of these two schemas. The mapping should produce an instance of the target schema for a given instance of the source schema. The target instance should correctly represent the information contained in the source instance under the constraints imposed by the target schema [ 3, 11, 19, 29 ]. The problem of schema mapping arises in many areas of data management systems. Recently, this problem has received considerable attention in the context of data exchange [ 3, 10 ], data integration [16], schema evolution [18], P2P databases [ 5, 20 ], life science databases [21] or e-commerce [ 6 ], where data comes from many di®erent sources with di®erent schemas. In data management [18] such operations on mappings as composition [ 17, 11, 19 ] and inversion [ 9 ] has been considered.

Schema mappings between relational schemas are usually de¯ned by means of the source-to-target dependencies [ 1, 12, 19 ]. Recently, this method was adapted ? The work was supported in part by the Polish Ministry of Science and Higher Education under Grant N516 015 31/1553. to XML data and tree-patterns are used to capture the context-free structure of XML trees [ 3 ]. However, the approach does not take (contextual) key constraints into account.

In this paper we assume that the de¯nition of a schema is speci¯ed in XSD (XML Schema De¯nition [27]) language, and except for context-free constraints the following three classes of contextual constraints are taken into account: keys, key references, and value dependencies. 1. Key constraints state that a subtree is uniquely identi¯ed by a tuple of values of key paths [ 4, 27 ]. They are speci¯ed within the <xs:key> elements. 2. Keyref constraints assert a correspondence of the tuples of values resulting from evaluating a referencing key (a foreign key) with those of the referenced key [27]. They are speci¯ed within the <xs:keyref> elements. 3. Value dependency constraints impose that a text value of a text node depends on a tuple of values of other nodes (resembling functional dependencies in relational databases [ 1 ]). The value dependencies are declared in the <xs:valdep> section of XSD (this section can be included in <xs:annotation> element of XSD).

We introduce key-pattern formulas which are intended to represent keys de¯ned within XSD, and are used to control mapping execution. It makes the process of XML data mappings and transformations semantically richer. We propose algorithms that construct an intended key-preserving target instance, we also formulate a theorem that is the basis for the presented method. In our approach, we distinguish among three kinds of mappings: automappings which are automatically generated from XSD speci¯cations and capture both context-free and contextual properties; c-mappings that express correspondences between source and target schemas; t-mappings which are derived from auto- and c-mappings and de¯ne key-preserving transformations. The advantage of this approach is that the management of these mappings and the reasoning over them can be carried out independently.

The structure of the paper is as follows. In Section 2 formal de¯nitions concerning XML trees and XML schemas are given and a running example is presented. The fundamentals of schema mappings are reviewed in Section 3. In this section we also characterize the existing approaches. In Section 4 we propose a formalism for de¯ning key-preserving XML schema mappings. The theorem and the algorithm concerning the mapping execution are presented in Section 5. Section 6 concludes the work and formulates directions of the future research. 2

XML trees and XML schemas

We consider XML data as a node-labeled unranked and ordered tree (XML tree) [28]. For simplicity, attribute nodes are represented by text nodes.

Let L ½ Lab be a ¯nite set of labels, Str be a set of string values (with the distinguished null values ?; ?1; ?2; ::: in Str), and DId be a set of document identi¯ers (e.g. URI addresses).

De¯nition 1. An XML tree I is a tuple (r; N e; N t; ·; child; ¸; º; ¾), where: { r is a distinguished root node, N e is a ¯nite set of element nodes, and N t is a ¯nite set of text nodes; { · { a total ordering relation on the set of nodes; { child µ (frg [ N e) £ (N e [ N t) { an acyclic binary relation introducing tree structure into (r; N e; N t) such that the root has only one child (this child is the top element) and each element node must have a child; { ¸ : N e ! L { a function labeling element nodes (the label is the type of the node); { º : N t ! Str { a function labeling text nodes with their text values; { ¾ : frg ! DId { a function that assigns a document identi¯er to the root. ¤

The XML trees will be considered as instances of XML schemas. We will restrict ourselves to context-free and contextual (key) constraints speci¯ed by an XML schema. The context-free constraints are given by regular expressions describing types of children for any element node type. Keys specify how subtrees are identi¯ed within an XML tree by means of text values of key paths.

A path P is de¯ned by the grammar P ::= =l j ==l j l j P=l j P==l; where: =l denotes a set of children of type l of the root, ==l denotes all descendent nodes of type l of the root, P=l denotes a set of nodes of type l which are children of the nodes denoted by P , P==l denotes all "descendent-or-self" nodes of type l of the nodes denoted by P .

We assume that there is a key for any node type. Following [ 4 ], we assume the following de¯nition of a key for XML: De¯nition 2. A key is an expression of the form (P; (P 0; (P1; :::; Pk))); where every P=P 0=Pi is a valid path. The path P is the context path, P 0 is the target path, and P1; :::; Pk are key paths of the key. ¤ De¯nition 3. An XML schema S over a set L of labels is a tuple

S = (top; ContF ree; KeyDep; Keyref Dep; V alDep); where: { top is the the distinguished label of the top (outermost) element; { ConF ree is a function from L to regular expressions over L ¡ ftopg de¯ned by the grammar e ::= ² j l j eje j e; e j e? j e+ j e¤; { KeyDep assigns a key (P; (l; (P1; :::; Pk))) to any label l 2 L ¡ ftopg. { KeyRef Dep assigns referential constraints, i.e. expressions of the form, (P; l; (P1; :::; Pk)) ref KeyDep(l0), to some labels in L; l; l0 2 L ¡ ftopg. The expression de¯nes a foreign key (P; l; (P1; :::; Pk)) that refers to a primary key KeyDep(l0). { V alDep assigns a set of value dependencies to some labels in L. A value dependency is an expression of the form (P=l; P 0) = f (P=l=P1; :::; P=l=Pm). De¯nition 4. An XML tree I = (r; N e; N t; ·; child; ¸; º; ¾) conforms to the XML schema S = (top; ConF ree; KeyDep; KeyRef Dep; V alDep), if: 1. ¾(r) is de¯ned, and for any text node n 2 N t, º(n) is de¯ned. 2. If (r; n) 2 child, then ¸(n) = top. 3. For any node n in N e with children (n1; :::; nm) such that n1 < ::: < nm, if ¸(n) = l, then the sequence ¸(n1):::¸(nm) is a word of the language de¯ned by the regular expression ConF ree(l). 4. If KeyDep(l) = (P; (l; (P1; :::; Pk))), then each subtree rooted in a node n of type l in an context denoted by the context path P , is uniquely identi¯ed by a tuple of text values (v1; :::; vk) of the key paths (P1; :::; Pk); k ¸ 0. For k = 0, (P; (l; ())) indicates that n is unconditionally unique in the context of P . 5. If KeyRef Dep(l) = (P; l; (P1; :::; Pk)) ref (P 0; (l0; (P10; :::; Pk0 ))), then for any tuple (v1; :::; vk) of values such that P [l[P1 = v1 ^ ¢ ¢ ¢ ^ Pk = vk]] there exists exactly one node n 2 [[P 0=l0]] such that the tuple (P10; :::; Pk0 ) of key paths on n has the value equal to (v1; :::; vk). 6. If (P=l; P 0) = f (P=l=P1; :::; P=l=Pn) 2 V alDep(l), then text values of the path P=l=P 0 are functionally dependent on the set of tuples of text values determined by (P=l=P1; :::; P=l=Pn). We use the name f to distinguish two di®erent dependencies having the same set of determining paths. ¤ An XML tree I is called an instance of a schema S, denoted I j= S, i® conforms to S, i.e. satis¯es all the constraints imposed by S.

Example 1. Sample XML trees and XML schema trees are given in Fig. 1. The XML trees in Fig. 1 represent the bibliographical data. The node labels are as follows: paper (P ) and title (T ) of the paper; author (A), name (N ) and the a±liation (U ) of the author; year (Y ) of publication and the conference (C) where the paper was presented; R and K are used to join authors with their papers.

If we restrict ourselves to the context-free structure only, then de¯nitions of S1, S2 and S3 can be given as DTDs in Fig. 2. In order to specify contextual constraints, we will use the schema de¯nition language XSD. In Fig. 4 the complete XSD de¯nitions for S1 and fragments for S2 and S3 are given. It can be seen that the XML tree I1 is an instance of S1, whereas I2 and I20 are instances of S2 with respect to the DTDs D1 and D2, respectively. Moreover, I1 and I2 are instances of schemas speci¯ed by XSDs SD1 and SD2, since they satisfy keys, respectively: (1) (2) (S2; (A; (N; P=T ))); (S2=A; (N; ())); (S2=A; (U; ())); (S2=A; (P; (T ))); (S2=A=P; (T; ())); (S2=A=P; (Y; ())): For example, each subtree in I 20 denoted by =S2=A is uniquely identi¯ed by a pair of text values of the pair of key paths (N; P =T ). In Figure 3 there are constraints assigned to elements of the schema S3. ¤

I1 : S1 T t1

A N U N U a1 u1 a2 u2

S1: S1

T t2 T

A+ P* N

P A N a1 U?

A N U P a1 u1

I2: S2 T Y T Y t1 ⊥1 t2 ⊥2

S2: S2 N

A* U?

P+

T Y? P

N U a2 u2

P T Y t1 ⊥1

A N U P a1 u1

I2’: S2

A N U P a1 u1

S3: S3 T Y t1 ⊥1

T Y t2 ⊥2 A*

P+ N

R* K

Y? C?

A N U P a2 u2

T Y t1 ⊥1 source instances: let S be a source schema, T a target schema and M a schema mapping, i.e. a formula expressing a relationship between S and T . For a given instance I of S, ¯nd an instance J of T such that hI ; J i j= M. Such an instance J is called a solution for I under M.

In the relational data exchange setting, source-to-target dependencies (STDs) [ 1 ] are usually used to express schema mappings [12, 19]. The STD is a ¯rst-order formula: STD):

8x(©(x) ) 9yª (x; y)); that after skolemization has the form of the following second-order STD (SO 9f 8x(©(x) ^ Â(x; y) ) ª (x; y)); where: (1) f is a vector of function symbols, x, y are vectors of variables; (2) © is a conjunction of atoms over source schemas; (3) Â(x; y) is a conjunction of equalities of the form t = t0 where t and t0 are terms over f , x and y; (4) ª is a conjunction of atoms over target schema; (5) each variable is safe.

In [ 11 ] it was shown that SO STDs are strictly more expressive then STDs and are closed under composition. It means that the composition of two SO STD mappings, M13 = M12 ± M23, is also a SO STD mapping (this is not true for ¯rst-order STD mappings). The semantics of the composition of schema mappings is de¯ned by means of the composition of two binary relations.

For schemas containing nested data (such as XML schemas) extensions of the above mentioned formalism were proposed [22, 29, 13]. In [ 3 ], tree-pattern formulas [ 2 ] are used in STDs instead of the relational atoms. Further on, such SO STDs with tree-pattern formulas will be referred to as SO XSTDs (second-order XML source-to-target dependencies). In this way correspondences between XML data can be expressed. However, such mappings do not take key constraints into consideration and they are not capable of ensuring that the target instance conforms to a target schema with key speci¯cation. Speci¯cation in [ 3 ] is restricted to DTD schemas only. To illustrate this problem suppose that we want to restructure the instance I1 (Fig. 1) of D1 under D2 (Fig. 2).

The following XSTD formula speci¯es a mapping from D1 to D2: M := 8xT ; xN ; xU (=S1[P [T = xT ^ A[N = xN ^ U = xU ]]]

) 9xY =S2[A[N = xN ^ U = xU ^ P [T = xT ^ Y = xY ]]]):

Using the speci¯cation, two possible target instances, I2 and I20 (Fig. 1) can be produced. These instances conform to D2. However, I2 and I20 conforms to di±cult sets of keys { given in (2) and (3), respectively. In [ 3 ], a repair process is also proposed which allows for inventing null values in the ¯nal XML tree and to state that some null values should be equal.

In the next sections we will show how both key constraints and value dependencies can be captured by mappings. 4

Key-preserving XML schema mappings In this paper we propose key-pattern formulas, which are a kind of tree-pattern formulas intended to re°ect keys in the target schema. A key-pattern formula <schema xmlns="http://www.w3.org/2001/XMLSchema"<>schema xmlns="http://www.w3.org/2001/XMLSchema"> <element name="S1"> <element name="S2"> <complexType> <complexType> <sequence> <sequence>

<element ref="P" minOccurs="0" <element ref="A" ... /> maxOccurs="unbounded" /> </sequence> </sequence> </complexType> </complexType> </element> </element> <element name="A"> <element name="P"> ...

<complexType> </element> <sequence> ...

<element name="T" type="string" /> </schema> <element ref="A" minOccurs="1"

maxOccurs="unbounded" /> SD3: </sequence> </complexType> <key name="PKey"> <selector xpath="." /> <field xpath="T" /> </key> </element> <element name="A"> <complexType> <sequence> <element name="N" type="string" /> <element name="U" type="string"

minOccurs="0" /> </sequence> </complexType> <key name="AKey"> <selector xpath="." /> <field xpath="N" /> </key> <valdep> <target name="U" /> <function name="fu" /> <source xpath="N" /> </valdep> </element> </schema> <schema xmlns="www.w3.org/2001/XMLSchema"> <element name="S3"> <complexType> <sequence> <element ref="A" ..." /> <element ref="P" ..." /> </sequence> </complexType> </element> <element name="A"> ... <keyref name="AKeyref" refer="PKey"> <selector xpath="." /> <field xpath="R" /> </keyref> </element> <element name="P"> ... <key name="PKey"> <selector xpath="." /> <field xpath="K" /> </key> ...

</element> </schema> E ::= x j C ¼ ::= P [E] j ¼=P [E]

C ::= P = x j ¼ j C ^ C valuation !, where ! is a total function from the set of all variables occurring ¤ in ¼ to Str. We denote by !(x) the value of ! on x (we assume !(x) = ?, if there is not any text value for x in I { this is possible because of semistructured nature of XML).

Further on, by [[¼]] and n[[¼]] we will denote a value of ¼ according to semantics of XPath expressions [25, 14], i.e. a set of nodes reachable from the root or from the context node n, respectively, via ¼.

De¯nition 6. The satisfaction of a tree-pattern formula under a valuation ! in a node n of an XML tree I, is de¯ned as follows (val(n) is the text value of the text node n): 1. (I; n) j= (x)(!) i® val(n) = !(x). 2. (I; n) j= (P = x)(!) i® there is a node n0 2 n[[P ]] such that (I; n0) j= (x)(!). 3. (I; n) j= (C1 ^ C2)(!) i® (I; n) j= (C1)(!) and (I; n) j= (C2)(!). 4. (I; n) j= (P [E])(!) i® there is a node n0 2 n[[P ]] such that (I; n0) j= (E)(!). 5. (I; n) j= (¼=P [E])(!) i® (I; n) j= (¼)(!) and there is a node n0 2 n[[¼=P ]] such that (I; n0) j= (E)(!).

We say that a formula Á is satis¯ed in (I; n), or that (I; n) satis¯es Á, denoted (I; n) j= Á, i® (I; n) j= Á(!) for some valuation ! over Á into I. I j= Á denotes that Á is satis¯ed in the root of I. ¤ De¯nition 7. A key-pattern formula ± is a tree-pattern formula restricted to the syntax: ± ::= =top=l[E] j ±=l[E] E ::= x j C

C ::= P = x j C ^ C ¤ Lemma 1. Let ±=l[E] be a key-pattern formula, and I be an XML tree. Then: I j= ±=l[E] , I j= ± ^ 9n 2 [[±]]((I; n) j= l[E]): (5) Proof. The lemma follows from (4) and (5) in De¯nition 6.

The above lemma will be used in Section 5 as the basis of an algorithm creating key-preserving target instances. Key-pattern formulas can be automatically generated from keys de¯ned in a schema S = (top; ConF ree; KeyDep; KeyRef Dep; V alDep) over a set L of labels, by means of the following recursive algorithm: Algorithm 1 key-patt(key), generation of a key-pattern formula representing the given key.

Input : Schema S = (top; ConF ree; KeyDep; KeyRef Dep; V alDep) over a set L of labels, label l 2 L ¡ ftopg, key · = KeyDep(l) = (P; (l; (P1; :::; Pk))), k ¸ 0.

Output : Key-pattern formula key-patt(·). key-patt(·) = case of (·) (=top; (l; ())) : =top=l[x] (=top; (l; (P1; :::; Pk))) : =top=l[P1 = x1 ^ ¢ ¢ ¢ ^ Pk = xk] (P=l0; (l; ())) : key-patt(KeyDep(l0))=l[x] (P=l0; (l; (P1; :::; Pk))) : key-patt(KeyDep(l0))=l[P1 = x1 ^ ¢ ¢ ¢ ^ Pk = xk] endcase The automappings are identity mappings over schemas that describe how instances of the schema are transformed onto themselves. An automapping captures constraints speci¯ed within a schema by means of SO XSTD of the form: ¼(x) ^ Ã(x) ) ¢(x); (6) where: ¼(x) is a tree-pattern formula capturing the structure of the schema; Ã(x) is a conjunction of atoms of the form x = x0 and x = f (x1; :::; xn), where the former captures a key reference and the latter { a value dependence; ¢(x) is a conjunction of key-pattern formulas, where every formula represents a key in the schema. We assume that function symbols are quanti¯ed existentially, while variables { universally.

Example 2. Automappings of S1, S2 and S3, referring to XSDs in Fig. 4, have the following speci¯cations: 1. A1 := ¼1(xT ; xN ; xU ) ^ Ã1(xT ; xN ; xU ) ) ¢1(xT ; xN ; xU ), where { ¼1 := =S1[P [T = xT ^ A[N = xN ^ U = xU ]]] { Ã1 := xU = fU (xN ) { ¢1 := =S1=P [T = xT ] ^ =S1=P [T = xT ]=T [xT ] ^ =S1=P [T = xT ]=A[N = xN ] ^ =S1=P [T = xT ]=A[N = xN ]=N [xN ] ^ =S1=P [T = xT ]=A[N = xN ]=U [xU ] 2. A2 := ¼2(yT ; yN ; yU ; yY ) ^ Ã2(yT ; yN ; yU ; yY ) ) ¢2(yT ; yN ; yU ; yY ), where { ¼2 := =S2[A[N = yN ^ U = yU ^ P [T = yT ^ Y = yY ]]] { Ã2 := yU = fU (yN ) ^ yY = fY (yT ) { ¢2 := =S2=A[N = yN ] ^ =S2=A[N = yN ]=N [yN ] ^ =S2=A[N = yN ]=U [yU ] ^ =S2=A[N = yN ]=P [T = yT ] ^ =S2=A[N = yN ]=P [T = yT ]=T [yT ] ^ =S2=A[N = yN ]=P [T = yT ]=Y [yY ] 3. A3 := =S3[A[N = zN ^ R = zR] ^ P [K = zK ^ T = zT ^ Y = zY ^ C = zC ]] ^zR = zK ^ zY = fY (zT ) ^ zC = fC (zT ) ) =S3=A[N = zN ]^ =S3=A[N = zN ]=N [zN ] ^ =S3=A[N = zN ]=R[zR] ^ =S3=P [K = zK ]^ =S3=P [K = zK ]=T [zT ] ^ =S3=P [K = zK ]=Y [zY ] ^ =S3=P [K = zK ]=C[zC ] 4.2

C-mappings A c-mapping speci¯es the correspondence between source and target schemas, and states how patterns in the source tree correspond to patterns in the target tree. They are SO XSTDs of the form: ¼S(x) ^ ÁS;T (x; y) ) ¼T (x; y); (7) where: x and y are source and target variables, respectively; ¼S(x) is a treepattern formula over a source schema S and de¯nes source variables; ÁS;T (x; y) is a conjunction of atoms of the form x = x0, y = y0 and y = f (x1; :::; xn), where x = x0 and y = y0 are restrictions over variables, and y = f (x1; :::; xn) de¯nes a target variable by means of the source ones; ¼T (x; y) is a tree-pattern formula over a target schema T .

Discovering c-mappings or correspondences between schemas is a crucial problem in schema mappings and may involve variety of methods and tools [ 23, 7, 8 ]. However, formal methods for representing mappings based on well established formal languages enable inferring new mappings in a repository of accepted mappings [ 17, 11, 19, 9, 20 ]. In this way user attention [15] can be utilized. Example 3. A c-mapping from S1 into S2 has the following speci¯cation:

M12 := ¼1(zT ; zN ; zU ) ^ Á12(zT ; zN ; zU ; zY ) ) ¼2(zT ; zN ; zU ; zY ); where ¼1(zT ; zN ; zU ) and ¼2(zT ; zN ; zU ; zY ) are tree-pattern formulas speci¯ed in automappings of S1 and S2, respectively, and Á12(zT ; zY ) is the atom formula zY = fY (zT ) de¯ning target variable zY . ¤ A t-mapping or a transformation is a speci¯cation that describes how data structured under the source schema is to be transformed into data structured under the target schema preserving keys in the target. A t-mapping TST from a source schema S into a target schema T is a SO XSTD formula of the form ¼S(x) ^ ÃS;T (x; y) ) ¢T (x; y); (8) and can be automatically derived from the c-mapping MST from S to T and the automappings AT over T . Then the following rule is used:

MST = ¼S(x1) ^ ÁST (x1; y1) ) ¼T (x1; y1)

AT = ¼T (x2) ^ ÃT (x2; y2) ) ¢T (x2; y2)

TST = (¼S(x1) ^ ÁST (x1; y1))[(x1; y1) 7! x2] ^ ÃT (x2; y2) ) ¢T (x2; y2) The formula (¼S(x1)^ÁST (x1; y1))[(x1; y1) 7! x2] arises from ¼S(x1)^ÁST (x1; y1) by replacing variables in (x1; y1) with corresponding variables in x2. The replacement is made according to occurrences of variables within ¼T (x1; y1) and ¼T (x2).

Example 4. From the c-mapping M12 (Example 3) and the automapping A2 (Example 2), the following t-mapping T12 from S1 into S2 can be obtained: T12 := ¼1(yT ; yN ; yU ) ^ Á12(yT ; yN ; yU ; yY ) ^ Ã2(yT ; yN ; yU ; yY )

) ¢2(yT ; yN ; yU ; yY ) 5

Creation of key-preserving target instances In this section we propose an algorithm generating a target instance for a given t-mapping. The left-hand side of the t-mapping provides a set of variable valuations, and the right-hand side consists of a set of key-pattern formulas. Each key-pattern formula is used to create a Skolem term that for valuations in delivers nodes of the expected target XML tree. We will show that the created instance satis¯es keys represented by key-pattern formulas.

Let ± be a key-pattern formula with variables in x. If for each valuation ! over ± into I the set [[±(!)]] has exactly one element, then a key represented by ± is satis¯ed in I. If ±0 is satis¯ed in I and ± is a key-pattern formula of the form ±0=l[E] then, using Lemma 1, we can recursively analyze whether ± represents a key satis¯ed in I or not.

The following theorem formulates a necessary condition for satisfaction of a key in an XML tree I. We will show that if the condition is satis¯ed in I by a key-pattern formula ±, then the key represented by ± is satis¯ed in I. That observation will be the base of our implementation (Algorithm 2). Theorem 1. Let ± = ±0=l[E] be a key-pattern formula and I be an XML tree: { ±0 = =top=l1[E1]=:::=lm[Em]; { E = (P1 = x1 ^ ¢ ¢ ¢ ^ Pk = xk); { { a set of all variable valuations over ± into text values in I. If for any valuation ! 2 count([[±0=l(!)]]) = countf! 2 j I j= ±(!)g (9) then the key represented by ± is satis¯ed in I, i.e.

I j= (=top=l1=:::=lm; (l; (P1; :::; Pk))): Proof. (Sketch) The key represented by ±0 is satis¯ed in I if for each valuation ! 2 , a set [[±0(!)]] has exactly one element. By structural induction we will proof that for any ! 2 the equality (9) implies count([[±(!)]]) = 1. Indeed: 1. The thesis holds for ± = =top=l[P1 = x1 ^ ¢ ¢ ¢ ^ Pk = xk], since: { count([[=top(!)]]) = 1, for any ! 2 , { if the number of nodes of type l determined by =top=l is equal to the number of di®erent valuations of variables (x1; :::; xk), then the key (=top; (l; (P1; :::; Pk))) represented by ± is satis¯ed in I. 2. Let ± = ±0=l[P1 = x1 ^ ¢ ¢ ¢ ^ Pk = xk]. Then from the induction hypothesis count([[±0(!)]]) = 1, for each ! 2 . Now it is easy to see that the equality (9) implies that count([[±(!)]]) = 1, for each ! 2 ; hence the key represented by ± is satis¯ed in I. ¤ The theorem states that if the number of nodes in [[±0=l]] is equal to the number of di®erent valuations of variables in ± satisfying I, then the key represented by ± is satis¯ed in I.

This observation is the basis of our algorithm for generation of key-preserving instances in data exchange. The one-to-one (bijective) correspondence between sets mentioned in the thesis of the theorem, equality (9), can be achieved by means of node-valued Skolem functions. The list of arguments of a Skolem function Fl consists of all variables x occurring in ±=key-patt(KeyDep(l)). Then every valuation ! 2 creates a unique element node Fl(!(x)) of type l. Similarly for text nodes.

Now, we shall de¯ne the semantics for t-mappings. The left-hand side: ¼S (x) ^ ÃS;T (x; y) of a t-mapping (8), determines a totally ordered set of variable valuations.

The ordering of valuations in is induced by the lexicographic ordering of tuples of text nodes whose values are assigned to the vector x of variables. By ord(!) we denote the ordering position of ! in . To order nodes in a result target tree, we will use the Dewey order encoding [24], where each node n is assigned a sequence P os(n) that represents position of n in the tree. For example, P os(n) = 1:3:2 says that n is the second child of the third child of the top element.

We assume: { P os(r) = 0, if r is the root; P os(n) = 1, if n is the top element; { P os(n) = P os(n0):ord(!), where n0 is the parent of n and n0 is obtained under a valuation ! 2 .

Let be a set of valuations, and ¢ = f±1; :::; ±pg be a set of key-pattern formulas occurring in the right-hand side of a t-mapping (8).

Then an XML tree J = semT (¢)(); where J = (r; N e; N t; ·; child; ¸; º; ¾); is created as follows:

semT (¢)() = fsemT (±)(!) j ± 2 ¢; ! 2 g;

Conclusions

We proposed a new approach to use XML schema mappings in data exchange. To this aim we introduced a new class of tree-pattern formulas called keypattern formulas that provide a natural way to express contextual constraints in XML schema mappings and are used to perform key-preserving mappings. We distinguished among three classes of mappings: automappings (transformations of a schema onto itself), c-mappings (correspondences) and t-mappings (key-preserving transformations). The distinction is justi¯ed by di®erent ways in which they are obtained: automatically from a schema speci¯cation (automappings), quasi-manually (c-mappings), or by deriving from another mappings (tmappings). In our formalism, mappings capture three kinds of contextual constraints: keys, key references and value dependencies. We demonstrated bene¯ts of this formalism including increased speci¯cation accuracy, and the ability to manage mappings and reasoning on them. In the future we plan to consider operation on key-preserving mappings, as well as on mappings capturing another constraints. For this purpose we need reasoning procedures on tree-patterns and ontological knowledge describing the semantics of XML schemas which may be applied for ¯nding correspondences between di®erent schemas [ 6 ].

1. Abiteboul , S. , Hull , R. , Vianu , V. : Foundations of Databases , Addison-Wesley, Reading, Massachusetts, 1995 .

2. Amer-Yahia , S. , Cho , S. , Lakshmanan , L. V. S. , Srivastava , D. : Tree pattern query minimization ., VLDB Journal , 11 ( 4 ), 2002 , 315 { 331 .

3. Arenas , M. , Libkin , L. : XML Data Exchange: Consistency and Query Answering , PODS Conference, 2005 , 13 { 24 .

4. Buneman , P. , Davidson , S. B. , Fan , W. , Hara , C. S. , Tan , W. C. : Reasoning about keys for XML, Information Systems , 28 ( 8 ), 2003 , 1037 { 1063 .

5. Calvanese , D. , Giacomo , G. D. , Lenzerini , M. , Rosati , R.: Logical Foundations of Peer-To-Peer Data Integration. , Proc. of the 23rd ACM SIGMOD Symposium on Principles of Database Systems (PODS 2004 ), 2004 , 241 { 251 .

6. Cybulka , J. , Meissner , A. , Pankowski , T. : Schema- and Ontology-Based XML Data Exchange in Semantic E-Business

Applications

, Business Information Systems, BIS 2006, Lecture Notes in Informatics , Vol. 85 , 2006 , 429 { 441 .

7. Doan , A. , Domingos , P. , Halevy , A. : Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach , ACM SIGMOD 2001 , ACM, 2001 , 509 { 520 .

8. Doan , A. , Noy , N. F. , Halevy , A. Y. : Introduction to the Special Issue on Semantic Integration ., SIGMOD Record , 33 ( 4 ), 2004 , 11 { 13 .

9. Fagin , R.: Inverting schema mappings ., PODS Conference, 2006 , 50 { 59 .

10. Fagin , R. , Kolaitis , P. G. , Miller , R. J. , Popa , L. : Data Exchange: Semantics and Query Answering ., ICDT 2003 , Lecture Notes in Computer Science 2572 , Springer, 2002 , 207 { 224 .

11. Fagin , R. , Kolaitis , P. G. , Popa , L. : Data exchange: getting to the core ., ACM Trans. Database Syst ., 30 ( 1 ), 2005 , 174 { 210 .