An Algorithm for Transforming XPath Expressions According to Schema Evolution Kazuma Hasegawa Kosetsu Ikeda Nobutaka Suzuki Graduate School of Library, Graduate School of Library, Faculty of Library, Information Information and Media Studies Information and Media Studies and Media Science University of Tsukuba University of Tsukuba University of Tsukuba s1221595@u.tsukuba.ac.jp lumely@slis.tsukuba.ac.jp nsuzuki@slis.tsukuba.ac.jp ABSTRACT XML is a de-fact standard format on the Web. In general, schemas of XML documents are continuously updated ac- cording to changes in real world. If a schema is updated, then query expressions have to be transformed so that they are “valid” under the updated schema, since the expressions are no longer valid under the updated schema due to the schema update. However, this is not an easy task since many of recent schemas are large and complex and thus it is becoming difficult to know how to update the query expres- sions correctly. In this paper, we propose an algorithm for transforming XPath expressions according to schema evolu- tion. For an XPath expression p and a schema S, our algo- rithm treats both p and S as tree automata T Ap and T AS , respectively. Our algorithm first takes the product automa- ton T AR of T Ap and T AS , then analyze T AR to find the correspondence between the states of T Ap and T AS . Based on this correspondence, the algorithm transforms T Ap ac- cording to an update operation applied to T AS . We also show some preliminary experimental results. Categories and Subject Descriptors H.2.1 [Database Management]: Logical Design—Data models General Terms Figure 1: An example of XPath transformations Algorithms Keywords real world, which is called schema evolution. If a schema is updated, then query expressions have to be transformed so XML, XPath, schema evolution, tree automaton that they are “valid” under the updated schema, since the expressions are no longer valid against the updated schema 1. INTRODUCTION due to the schema update. However, this is not an easy task XML[5] is a de-fact standard format on the Web. An XML since many of recent schemas are large and complex, and document is usually stored with its schema so that the struc- thus it is becoming very difficult to know how to update the tural consistency of the document is ensured. In general, query expressions correctly. schemas are continuously updated according to changes in In this paper, we propose an algorithm for transforming XPath expressions according to schema evolution. Here, XPath[4] is the most popular query language for XML. For a given schema S, an edit operation op to S, and an XPath expression p under S, our algorithm transforms p into an XPath expression p′ “equivalent” to p whenever possible, This work is licensed under the Creative Commons Attribution- that is, the result of p′ under op(S) coincides with that of ShareAlike 3.0 Unported License (CC BY-SA 3.0). To view a copy p under S, where op(S) is the updated schema obtained by of the license, visit http://creativecommons.org/licenses/by-sa/3.0/. applying op to S. DChanges 2013, September 10th, 2013, Florence, Italy. To illustrate our algorithm, let us show a simple example. ceur-ws.org Volume 1008, http://ceur-ws.org/Vol-1008/paper4.pdf . Let D be the DTD in Fig. 1(a), and suppose that element students is inserted as the child of school (Fig. 1(b)). Ac- An XML document is modeled as a labeled ordered tree, cording to this schema update, our algorithm transforms the and a schema is modeled as a tree automaton. Formally, a following XPath expression tree automaton is a quadruple T A = (N, Σ, s, P ), where /school/student[supervisor]/name • N is a set of states (element types), • Σ is a finite set of element names, into the following: • s ∈ N is the start state, /school/students/student[supervisor]/name. • P is a set of transition rules of the form X → a(reg) Then, suppose that element supervisor is deleted from the or X → Y , where X, Y ∈ N and reg is a regular DTD in Fig. 1(b). In this case, it is impossible to transform expression over N . the above XPath expression into an equivalent one. Thus, For a transition rule X → a(reg), we say that X is the left- as an alternative answer our algorithm deletes supervisor hand side, a(reg) is the right-hand side, a is the label, and from the above expression and returns the following: reg is the content model of the rule. For example, consider /school/students/student/name. the tree automaton T Aa shown in Fig. 2. This is equiva- lent to the DTD in Fig. 1(a). “School” in N is the type of Although these DTDs are very small, schemas used in prac- element “school” , and so on. We assume that element “pc- tice are much larger and complex [6]. Therefore, a user tends data” represents an arbitrary string. By L(T A) we mean the not to understand the entire structure of a schema exactly, language of tree automaton T A, i.e., the set of trees “valid” and thus our algorithm is helpful to transform XPath ex- against T A. pressions appropriately according to schema evolutions. Following [15], we define the product of tree automata. Let In this paper, schema is modeled as (unranked) tree au- TA1 = (N1 , Σ1 , s1 , P1 ) and TA2 = (N2 , Σ2 , s2 , P2 ) be tree tomaton, which is equivalent to regular tree grammar. Tree automata. Without loss of generality, we assume that Σ1 = automaton is the formal model of RELAX NG, and XML Σ2 = Σ. First, we define the product reg 1 ⊕ reg 2 , where reg 1 Schema and DTD can also be modeled by tree automa- is a regular expression over N1 and reg 2 is a regular expres- ton [14]. Moreover, tree automaton is equivalent to special- sion over N2 . Then reg 1 ⊕ reg 2 is a regular expression over ized DTD [17]. Thus, our algorithm is applicable to most of N1 × N2 such that n11 n21 · · · ni1 matches reg 1 and n12 n22 · · · ni2 formal and practical XML schema languages. As for XPath, matches reg 2 if and only if [n11 , n12 ][n21 , n22 ] · · · [ni1 , ni2 ] matches we focus on an XPath fragment using child and descendant- reg 1 ⊕ reg 2 , where n11 n21 · · · ni1 is a sequence of states in N1 or-self axes with predicates. Although our XPath fragment and n12 n22 · · · ni2 is a sequence of states in N2 . reg 1 ⊕ reg 2 is supports no upward axes, this gives usually little problem constructed as follows. since the majority of XPath queries uses only downward 1. reg ′1 is obtained from reg 1 by replacing each state n axes[10]. Thus, we believe that our algorithm is useful to by [n1 , n12 ]|[n1 , n22 ]| · · · |[n1 , nk2 ], where n12 , n22 , · · · , nk2 correct a large number of XPath queries. is an enumeration of N2 . Similarly, reg ′2 is obtained from reg 2 . Related Work The study most related to this paper is [13]. This study 2. Construct two automata A1 and A2 from reg ′1 and proposes an algorithm for transforming XPath expressions reg ′2 , respectively. according to schema evolution, assuming that a schema 3. Construct the product automaton of A1 and A2 . monotonically increases (no element can be deleted from a schema). On the other hand, this paper has no such an as- 4. Construct a regular expression, reg 1 ⊕ reg 2 , from the sumption and allows more general schema updates. Since product automaton. actual schema evolutions usually involve element deletions The product automaton of TA1 and TA2 is TA3 = (N1 × or some similar updates, our algorithm is more practical in N2 , Σ, s1 × s2 , P3 ), where real world situations. To the best of our knowledge, there is no study on transforming XPath expressions according to P3 = {[n1 , n2 ] → a(reg1 ⊕ reg2 ) | (n1 → a(reg1 ) ∈ P1 , schema evolution, except [13]. However, several studies deal n2 → a(reg2 ) ∈ P2 ) with update operations to schemas. For example, Ref. [18] ∪{[n1 , n2 ] → [n1 , n′2 ] | (n1 ∈ N1 , n2 → n′2 ∈ P2 )} proposes a “complete” set of update operations to DTDs. Refs. [12, 9] propose update operations schemas and algo- ∪{[n1 , n2 ] → [n′1 , n2 ] | (n1 → n′1 ∈ P1 , n2 ∈ N2 )}}. rithms for extracting “diff” between two schemas. Refs. [7, By definition, it is immediate that for any tree t, t ∈ L(T A3 ) 8, 19] propose update operations that assures any updated if and only if t ∈ L(T A1 ) and t ∈ L(T A2 ). schema contains its original schema. Recently, Ref. [16] Let Σ be a set of element names. We define XPath ex- introduces a taxonomy of possible problems induced by a pression p as follows. schema change, and gives an algorithm to detect such prob- lems. Ref. [11] studies query-update independence analysis, • p ::= /p′ and shows that the performance of [3] can be drastically • p′ ::= χ :: l | p′ /p′ | p′ [q], where l ∈ Σ enhanced in the use of µ-calculus. • χ ::=child | descendant-or-self 2. DEFINITIONS • q ::= p′ In this section, we define tree automaton and the product We call each χ ∈ {child, descendant-or-self} axis, and q pred- of tree automata. icate, respectively. N = {School, Student, ID, Name, Address, Supervisor, Pcdata}, Σ = {school, student, id, name, address, supervisor, pcdata}, P = {School → school(Student∗ ), Student → student(ID Name Address Supervisor?), ID → id(Pcdata), Name → name(Pcdata), Address → address(Pcdata), Supervisor → supervisor(Pcdata), Pcdata → pcdata(ϵ)}. Figure 2: Tree automaton T Aa = (N, Σ, School, P ) • del opr(A, a, i): Deletes the operator at position i in the content model of r, where r is the transition rule in P such that the left-hand side of r is A and that the label of r is a. • nest state(A, a, B, b, i): Replaces the subexpression E at position i (i.e., subtree rooted at position i) in con- tent model of r by state B, where r is the transition rule in P such that the left-hand side of r is A and Figure 3: Tree representation of (A|B)∗ that the label of r is a. Moreover, a transition rule B → b(E) is added to P . • unnest state(A, a, b, i): This is the inverse operation 3. UPDATE OPERATION TO TREE AU- of nest state, and replaces the state B at position i TOMATON in the content model of r by reg, where (1) r is the In this section, we define update operations to tree au- transition rule in P such that the left-hand side of r is tomaton. A and that the label of r is a and (2) reg is the content Let reg be a regular expression. To define update oper- model of the transition rule whose left-hand side is B ations to a tree automaton, we need to identify the posi- and label is b. tions of states and operators in reg. Thus we define the set • replace state(A, a, B, i): Replaces the state at position pos(reg) of positions in a reg, as follows. i in the content model of r by B. In this paper, this • If reg = ϵ or reg = a (a ∈ Σ), then pos(reg) = {λ}. operation is treated as a pair of unnest state(A, a, i) and nest state(A, a, B, b, i). This operation is used in • Otherwise, pos(reg) = {λ} ∪ {u | u = vw, 1 ≤ v ≤ order to “rename” element names. n, w ∈ pos(reg v )}, where reg v is a subexpression con- For example, consider the tree automaton T Aa in Fig. 2. sisting of the descendants of v in reg. Applying nest state(School, school, Students, students, λ) to T Aa , we obtain the tree automaton T Ab in For example, let reg = (A|B)∗ . Then pos(reg) = Fig. 4. Then applying del opr(Student, student, 4) and {λ, 1, 11, 12}. As shown in Fig. 3, λ is the position of ∗, del state(Student, student, 4) to Tb , we obtain the tree 1 is the position of |, 11 is the position of A, and 12 is the automaton T Ac in Fig. 5. position of B. In the following, without loss of generality we assume that for any state A and any element name a, there is at 4. TRANSFORMATION FROM XPATH most one transition rule whose left-hand side is A and la- EXPRESSION INTO TREE AUTOMA- bel is a. If there are transition rules A → a(reg1 ) and TON A → a(reg2 ), then these can be merged into one transition Our algorithm treats an XPath expression as a tree au- rule A → a(reg1 |reg2 ). tomaton. Thus, in this section we define a transformation We now define update operations to a tree automaton from an XPath expression into a tree automaton. This T A = (N, Σ, s, P ) as follows. transformation is based on [15]. • ins state(A, a, B, i): Inserts new state B at position i An XPath expression is transformed into a tree automaton in the content model of r, where r is the transition rule as follows. First, we transform a location step χ :: l into a in P such that the left-hand side of r is A and that the tree automaton. Each transformed tree automaton has input label of r is a. and output states, corresponding to the “context node” for χ :: l and the “result node” selected by χ :: l, respectively. • ins opr(A, a, opr, i): Inserts an operator opr (∗, +, ?) For an XPath expression p of the form p1 /p2 , p′ [q], or /p′ , we at position i in the content model of r, where r is the transform p into a tree automaton in a bottom-up manner, transition rule in P such that the left-hand side of r is according to the structure of p. A and that the label of r is a. We first define a tree automaton (N, Σ, s, P ) correspond- ing to location step χ :: l. In the following, A → σ(· · · ) • del state(A, a, i): Deletes the state at position i in the denotes {A → σ(· · · ) | σ ∈ Σ}. content model of r, where r is the transition rule in • If χ = child, then P such that the left-hand side of r is A and that the label of r is a. – The initial state s is B. N = {School, Students, Student, ID, Name, Address, Supervisor, Pcdata}, Σ = {school, students, student, id, name, address, supervisor, pcdata}, P = {School → school(Students), Students → students(Student∗ ), Student → student(ID Name Address Supervisor?), ID → id(Pcdata), Name → name(Pcdata), Address → address(Pcdata), Supervisor → supervisor(Pcdata), Pcdata → pcdata(ϵ)}. Figure 4: Tree automaton T Ab = (N, Σ, School, P ) N = {School, Students, Student, ID, Name, Address, Pcdata}, Σ = {school, students, student, id, name, address, pcdata}, P = {School → school(Students), Students → students(Student∗ ), Student → student(ID Name Address), ID → id(Pcdata), Name → name(Pcdata), Address → address(Pcdata), Pcdata → pcdata(ϵ)}. Figure 5: Tree automaton T Ac = (N, Σ, School, P ) – P consists of the following transition rules. The case of p[q] ∗ A → σ(A∗ ) Let T Ap = (Np , Σ, sp , Pp ) be the tree automaton of p and ∗ B → σ(A∗ (B|C)A∗ ) T Aq = (Nq , Σ, sq , Pq ) be the tree automaton of q. Moreover, ∗ B→C let T A = (N, Σ, s, P ) be the product automaton T Ap and ∗ C → σ(A∗ DA∗ ) T Aq . Then the tree automaton T Ap[q] of p[q] is defined as ∗ D → l(A∗ ) T Ap[q] = (Np[q] , Σ, s, P ), where • If χ = descendant-or-self, then Np[q] = {[np , nq ] | (np ∈ N Op , nq ∈ N Iq ) ∨ (np ∈ (Np − N Op ), nq ∈ (Nq − N Iq ))}. – Initial state s is B1 . – P consists of the following transition rules. The set N Ip[q] of input states and the set N Op[q] of output ∗ A → σ(A∗ ) states of T Ap[q] are defined as follows. ∗ B1 → σ(A∗ (B1 |C)A∗ ) N Ip[q] = {[np , nq ] | np ∈ N Ip , nq ∈ (Nq − N Iq )}, ∗ B1 → C N Op[q] = {[np , nq ] | (np ∈ N Op , nq ∈ N Iq )}. ∗ C → σ(A∗ (B2 |D)A∗ ) ∗ C→D The case of /p ∗ B2 → σ(A∗ (B2 |D)A∗ ) Let χ :: l be the first location step in p and let T Ap = ∗ D → l(A∗ ) (Np , Σ, sp , Pp ) be the tree automaton of p. Then the tree automaton T A/p of /p is defined as T A/p = (N/p , Σ, R, P/p ), The input state and the finite set of output states of a finite where tree automata are {C} and {D}, respectively. C represents the context node and D represents the node selected by the N/p = Np ∪ {R}, location step χ :: l. P/p = Pp ∪ {R → root(D)}, We next define the transformation from an XPath expres- sion p into a tree automaton. We have to consider the case where R is the initial state of T A/p and root is the element where p = p1 /p2 , p = p[q], and p = /p. In the following, for name corresponding to the root node. a tree automaton T Ap of p, Np denotes the set of states of The set N I/p of input states and the set N O/p of output T Ap , N Ip ⊂ Np denotes the set of input states of T Ap , and states of T A/p are defined as follows. N Op ⊂ Np denotes the set of output states of T Ap . N I/p = {R}, The case of p1 /p2 N O/p = N Op , Let T Ap1 = (Np1 , Σ, sp1 , Pp1 ) be the tree automaton of p1 where N Op is the set of output states of T Ap . and T Ap2 = (Np2 , Σ, sp2 , Pp2 ) be the tree automaton of p2 . Moreover, let T A = (N, Σ, s, P ) be the product automaton of T Ap1 and T Ap2 . Then the automaton T Ap1 /p2 of p1 /p2 5. ALGORITHM FOR TRANSFORMING is defined as T Ap1 /p2 = (Np1 /p2 , Σ, s, P ), where XPATH EXPRESSION ACCORDING TO Np1 /p2 = {[np1 , np2 ] | (np1 ∈ N Op1 , np2 ∈ N Ip2 ) ∨ SCHEMA EVOLUTION (np1 ∈ (Np1 − N Op1 ), np2 ∈ (Np2 − N Ip2 ))}. In this section, we show an algorithm for transforming a given XPath expression according to schema evolution. The set N Ip1 /p2 of input states and the set N Op1 /p2 of out- To describe our algorithm, we need some definitions. For put states of T Ap1 /p2 are defined as follows. an XPath expression p, the selection path of p is the XPath expression obtained by dropping every predicate from p. N Ip1 /p2 = {[np1 , np2 ] | np1 ∈ N Ip1 , np2 ∈ (Np2 − N Ip2 )}, Let p be an XPath expression of the form /p1 [q]/p2 . Then N Op1 /p2 = {[np1 , np2 ] | np1 ∈ (Np1 −N Op1 ), np2 ∈ N Op2 }. /p1 /q is called a predicate path of p (the predicate path for a predicate in q can be defined similarly). For example, Let – If all the result elements retrieved by p disappear p = /a[f ]/b[c/e]/d. Then /a/b/d is the selection path of p, from T A due to op, the result of p becomes empty. and /a/f and /a/b/c/e are the predicate paths of p. Then our algorithm returns nil. Let us first show the “main” algorithm (Fig. 6). Let op be – Otherwise, some element of op(T A) can still be the update operation applied to tree automaton T A. More- retrieved by p. Thus our algorithm transforms p over, let p be the input XPath expression, p0 be the selection so that such elements are retrieved. path of p and p1 , · · · , pk be the predicate paths of p. We transform pi to p′i according to op, for each i = 0, 1, · · · , k, and merge the resulting k expressions p′0 , p′1 , · · · , p′k as the Now we present function Transform (Fig.7). result. More concretely, the algorithm first partition p into state(A, a, i, P ) in steps 16, 23, and 29 denotes the p0 , p1 , · · · , pk (step 1). Then p0 is transformed into p′0 ac- state at position i in the content model of r, where r is cording to op by function Transform (shown later). If the the transition rule in P such that its left-hand side is A result of p0 becomes empty due to del state, then p′0 = nil and its label is a. If op is ins state, ins opr, or del opr, p and the transformation of p is terminated (steps 3 and 4). is unchanged (steps 4 to 7). If op is del state, we examine Otherwise, we transform pi for each i = 1, · · · , k (steps whether the result of p becomes empty under op(T A) (steps 6 to 8), by Transform. Finally, p0 , p1 , · · · , pk are merged 8 to 11). If the result of p does not become empty, p is and returned as the result. If a predicate path p′i is nil, unchanged (steps 11 to 12). If the result of p becomes then p′i is just ignored when merged. For example, let empty and p is the selection path, the algorithm returns nil p = /a[f ]/b[c/e]/d. Then p0 = /a/b/c, p1 = /a/f , and (steps 13 to 14). Otherwise (i.e., p is a predicate path), we p2 = /a/b/c/e. Suppose that c is deleted by unnest state. delete the suffix of p that becomes invalid due to op (steps Then we obtain p′0 = /a/b/d, p′1 = /a/f , and p′2 = /a/b/e 15 to 20). In step 17, P ′′ is obtained in step 2, and we due to Transform, and merging these three expressions we say that r is a transition rule from A to C if the left-hand have p′ = /a[f ]/b[e]/d, which is the result of the algorithm. side of r contains A and a state in the content model of Let us next consider function Transform. Let T A = r contains C. In step 18, we say that lsj corresponds to (N, Σ, s, P ) be a tree automaton, op be an edit operation r ∈ P ′′ if r is the intersection of (i) the transition rule to T A, and p be an XPath expression having no predicate. corresponding to lsj in Pp and (ii) some transition rule in Our objective is to transform p into an XPath expression p′ P . If op is nest state(A, a, B, b, i), we examine whether we so that the result of p′ under op(T A) coincides with that of p need to transform p (steps 23 to 24). If we need to do, we under T A. However, this is sometimes impossible if a state insert a new location step child :: b to p (steps 25 and 26). If (and its corresponding element) is deleted from T A. Thus, op is unnest state(A, a, b, i), then we also examine whether Transform is constructed as follows. we need to transform p (steps 30 to 31). Moreover, if the location step lsj+1 to be deleted has a predicate q and the • If op is ins state, ins opr, or del opr, then p is un- axis of lsj+1 is descendant-or-self, we delete q (steps 32 to changed. 33). Otherwise, we delete lsj+1 from p (steps 34 to 38). Finally, let us present the time complexity of the al- • If op is nest state(A, a, B, b, i), then location step gorithm. Let T A = (N, T, s, P ) be a tree automaton child :: b is inserted to p at the position that nest- (schema) p be an XPath expression that is partitioned into ing is occurred, if necessary (i.e., unless p contains a p0 , p1 , p2 , · · · , pk . Then the algorithm runs in O(k · |p|2 · descendant-or-self axis that “masks” the nest state op- |T A|), where |p| is the number of location steps in p and eration). |T A| is the size of T A. • If op is unnest state(A, a, b, i), then a location step whose node test is b is deleted from p, if it is not “masked” by a descendant-or-self axis. 6. EXPERIMENTAL RESULTS To verify if our algorithm transforms XPath expressions • If op is del state, then p is modified as follows. appropriately under real world schemas, we implemented our algorithm (in Ruby) and made a few experimentations. Input : XPath expression p, tree automaton T A, update opera- We use two pairs of schemas, MSRMEDOC DTDs (version tion op to T A. 2.1.1 and 2.2.2)[2] and the NLM Journal Publishing Tag Set Output : XPath expression or nil Tag Library DTDs (version 2.3 and 3.0)[1]. 1. Partition p into the selection path p0 and the predicate path First, we give the evaluation of our algorithm on MSRME- p 1 , p2 , · · · , p k . DOC DTDs. Let D211 be the version 2.1.1 MSRMEDOC 2. p′0 ← Transform(p0 , T A, op); DTD and D222 be the version 2.2.2 MSRMEDOC DTD. The 3. if p′0 = nil then number of elements of D211 is 185 and that of D222 is 205. 4. return nil; Table 1 shows the number of update operations between D211 and D222 , where “others” are attributes insertions (not 5. else supported by our algorithm). 6. for i ← 1 to k do We generate 90 XPath expressions under D211 by XQ- 7. p′i ← Transform(pi , T A, op); gen[20]. The average size (i.e., number of location steps) of 8. end the XPath expressions is 5, where the minimum size is 4 and 9. Merge p′0 , p′1 , · · · , p′k into p′ . the maximum size is 7. Each XPath expression uses child axes and at most one descendant-or-self axes (78 XPath ex- 10. return p′ ; pressions use a descendant-or-self axis). There are 5 XPath expressions whose result becomes empty under D222 . Since Figure 6: Main algorithm there is no predicate in these 90 XPath expressions, we Table 1: Update operations between D211 and D222 and between D23 and D30 ins state del state nest state unnest state replace state ins opr/del opr others total D211 → D222 61 11 27 0 0 60 32 191 D23 → D30 504 82 3 0 24 0 120 733 choose 4 expressions and add a predicate to each of 4 ex- pressions (given later). Function Transform(p, T A, op) We transform the above 90 XPath expressions by our al- Input : XPath expression p = /ls1 / · · · /lsm , tree automaton gorithm. For 85 expressions, the elements retrieved under T A = (N, Σ, s, P ), update operation op to T A. D222 coincides with the elements retrieved under D211 . For Output : XPath expression or nil the rest 5 XPath expressions, our algorithm returns nil since 1. Construct the tree automaton T Ap = (Np , Σ, sp , Pp ) of p the results of these expressions become empty under D222 . 2. Construct the product automaton T A′′ = (ND × These 5 XPath expressions are shown in Table 2 (deleted Np , Σ, sD × sp , P ′′ ) of T A and T Ap elements are italicized). Since elements LIST, NOTE, FIG- 3. switch op URE, and FORMULA (child elements of REMARK) and an 4. case ins state(A, a, B, i) element PRIVATE-CODES (child element of COMPANY- 5. case ins opr(A, a, opr, i) DOC-INFO) are deleted, the results of these XPath expres- 6. case del opr(A, a, i) sions become empty (note that if an element is deleted from 7. return p; a content model, then its descendants cannot be retrieved 8. case del state(A, a, i) either). 9. Construct the tree automaton T A′ = (N ′ , Σ, s′ , P ′ ) of The four XPath expressions with a predicate, denoted op(T A) p1 , p2 , p3 , p4 , are transformed as follows (Table 3). 10. Construct the product automaton T A′p = (N ′ ×Np , Σ, s′ × sp , Pp′ ) of T A′ and T Ap • Since element REMARK (child element of 11. if L(T A′p ) ̸= ∅ then COMPANY-REVISION-INFO) is deleted, the 12. return p; predicate of p1 is deleted. 13. else if p is the selection path then • Since element REMARK is deleted, the last location 14. return nil; step in the predicate of p2 is deleted. 15. else 16. C ← state(A, a, i, P ) // C is deleted • Since element REMARK is deleted, the location 17. if P ′′ contains a transition rule r from A to C then step corresponding to an element REMARK in p3 is 18. Let lsj be the location step in p corresponding to r; deleted, therefore our algorithm return nil. 19. p ← /ls1 / · · · /lsj−1 ; 20. end • Since element L-1 is inserted between P and FT and 21. end between P and STD, two location steps L-1 are in- 22. case nest state(A, a, B, b, i) serted to p4 . 23. C ← state(A, a, i, P ) // C is nested 24. if P ′′ contains a transition rule r from A to C then As shown above, we can say that every XPath expres- 25. Let lsj be the location step in p corresponding to r; sion is transformed appropriately by our algorithm, even if 26. p ← /ls1 / · · · /lsj /child :: b/lsj+1 / · · · /lsm ; del state’s are involved in the schema evolution. The total 27. end execution time of the algorithm for the 90 XPath expres- 28. case unnest state(A, a, b, i) sion and 191 update operations is 9.58 sec, thus it takes an 29. C ← state(A, a, i, P ) C is unnested average 0.111 sec per one XPath expression. 30. if P ′′ contains a transition rule r from A to C then We also made a similar experimentation using the NLM 31. Let lsj be the location step in p corresponding to r; Journal Publishing Tag Set Tag Library. Let D23 be the 32. if the axis of lsj+1 is descendant-or-self and lsj+2 is the version 2.3 The NLM Journal Publishing Tag Set Tag Li- first location step of a predicate then brary DTD and D30 be the version 3.0 DTD. The number 33. p ← /ls1 / · · · /lsj−1 ; of elements of D23 is 211 and that of D30 is 233. Table 1 34. else shows the number of update operations between D23 and 35. axis ← the axis of lsj+1 ; D30 . 36. l ← the node test of lsj+2 ; We generate 97 XPath expressions under D23 by XQgen. 37. p ← /ls1 / · · · /lsj /axis :: l/lsj+3 / · · · /lsm ; The average size of the XPath expressions is 6. Each XPath 38. end expression uses child axes and at most once descendant-or- 39. end self axes, where 95 XPath expressions include descendant- 40. end or-self axes. There are 10 XPath expressions whose result 41. return p; becomes empty under D30 . There is no XPath expression with a predicate, thus we choose 3 XPath expressions and added a predicate to each expression. Figure 7: Function Transform We transform the 97 XPath expressions by our algorithm. For 87 expressions, the elements retrieved under D30 coin- cides with the elements retrieved under D23 . For the rest 10 XPath expressions, our algorithm returns nil since the [7] G. Guerrini, M. Mesiti, and D. Rossi. Impact of XML results of these expressions become empty under D30 , which schema evolution on valid documents. In Proc. are listed in Table 4. Since elements citation, contract-num, WIDM, pages 39–44, 2005. contract-sponsor in element p are deleted, the results of [8] K. Hashimoto, Y. Ishihara, and T. Fujiwara. A these XPath expressions become empty. The three XPath proposal of update operations for schema evolution in expressions with a predicate, denoted p5 , p6 , p7 , are trans- XML databases and their properties on preservation formed as follows (Table 5). of schema’s expressive power. IEICE Transactions on Information and Systems (Japanese Edition), • Since element chem-struct (child element of ack) is J90-D(4):990–1004, 2007. deleted, the predicate of p5 is deleted. [9] K. Horie and N. Suzuki. Extracting differences • Since element chem-struct-wrapper is renamed to between regular tree grammars. In Proc. ACM SAC, chem-struct-wrap, the predicate of p6 is renamed ac- pages 859–864, 2013. cordingly. [10] Z. G. Ives, A. Y. Halevy, and D. S. Weld. An XML query engine for network-bound data. The VLDB • Since element custom-meta-wrap is renamed to Journal, 11(4):380–402, 2002. custom-meta-group, the corresponding location step in [11] M. Junedi, P. Genevès, and N. Layaı̈da. Xml p7 is renamed to custom-meta-group. query-update independence analysis revisited. In Proceedings of the 2012 ACM symposium on Again our algorithm seems to work well despite of del state Document engineering, DocEng ’12, pages 95–98, 2012. operations. These results suggest that our algorithm can be [12] E. Leonardi, T. T. Hoai, S. S. Bhowmick, and applied to XPath expressions under real world schema evo- S. Madria. DTD-Diff: a change detection algorithm lutions. The total execution time of the algorithm for the 97 for DTDs. In Proc. DASFAA, pages 817–827, 2006. XPath expression and 733 update operations is 77.561 sec, [13] T. Morimoto, K. Hashimoto, Y. Ishinara, and thus it takes an average 0.800 sec per one XPath expression. T. Fujiwara. Translating XPath queries according to XML schema evolution based on the tree-embedding 7. CONCLUSION relation. In IEICE Technical Report, vol.107, no.131, In this paper, we proposed an algorithm for transforming DE2007-40, pages 109–114, 2007 (in Japanese). an XPath expression according to schema evolution allow- [14] M. Murata, D. Lee, M. Mani, and K. Kawaguchi. ing element deletions. However, this is just an ongoing work Taxonomy of XML schema languages using formal and we have a lot to do. First, we would like to extend language theory. ACM Transactions on Internet our algorithm so that it can handle more general XPath ex- Technology, 5:660–704, 2005. pressions, e.g., supporting sibling and parent axes. Second, [15] A. Ohno, Y. Ishihara, and T. Fujiwara. Method of we use only two schema evolutions in our experimentation. analyzing the correspondence between types defined Thus we would like to evaluate our algorithm under more by an XML schema and XPath subexpressions. In real world schemas. Third, our algorithm supports neither IPSJ SIG-Report, 2009-MPS-75(20), pages 1–7, 2009 attribute nor entity declaration in schemas. These should (in Japanese). be incorporated into our algorithm. [16] R. Oliveira, P. Genevès, and N. Layaı̈da. Toward automated schema-directed code revision. In 8. ACKNOWLEDGEMENT Proceedings of the 2012 ACM symposium on Document This work is partly supported by Grants-in-aid for Scien- engineering, DocEng ’12, pages 103–106, 2012. tific Research (23500110). [17] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In Proc. ACM PODS, pages 35–46, 2000. 9. REFERENCES [18] H. Su, D. Kramer, L. Chen, K. Claypool, and E. A. [1] “Journal Publishing Tag Set” NLM Journal Archiving Rundensteiner. XEM: managing the evolution of XML and Interchange Tag Suite. documents. In Proc. IEEE RIDE, pages 103–110, http://dtd.nlm.nih.gov/publishing/. 2001. [2] “MSR Download” MSR Home. [19] N. Suzuki. An edit operation-based approach to the http://www.msr-wg.de/medoc/downlo.html. inclusion problem for DTDs. In Proc. ACM SAC, [3] M. Benedikt and J. Cheney. Destabilizers and pages 482–488, 2007. independence of xml updates. Proc. VLDB Endow., [20] Y. Wu, N. Lele, R. Aroskar, S. Chinnusamy, and 3(1-2):906–917, 2010. S. Brenes. XQGen: an algebra-based xpath query [4] A. Berglund, S. Boag, D. Chamberlin, M. F. generator for micro-benchmarking. In Proc. CIKM, Fernandez, M. Kay, J. Robie, and J. Simeon, editors. CIKM ’09, pages 2109–2110, 2009. XML Path Language (XPath) 2.0 (Second Edition). http://www.w3.org/TR/xpath20/. [5] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau, editors. Extensible Markup Language (XML) 1.0 (Fifth Edition). http://www.w3.org/TR/xml/. [6] B. Choi. What are real DTDs like? In Proc. WebDB, pages 43–48, 2002. Table 2: XPath expressions whose results become empty under D222 /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/LIST/ITEM /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/NOTE /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/FIGURE/DESC /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/FORMULA/FORMULA-CAPTION/LONG-NAME //ADMIN-DATA/COMPANY-DOC-INFOS/COMPANY-DOC-INFO/PRIVATE-CODES/PRIVATE-CODE Table 3: XPath expressions with predicate under D211 and D222 XPath expression p1 : //DOC-REVISION/COMPANY-REVISION-INFOS/COMPANY-REVISION- INFO[REMARK]/COMPANY-REF Transformed result: //DOC-REVISION/COMPANY-REVISION-INFOS/COMPANY-REVISION-INFO/COMPANY-REF XPath expression p2 : //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-REVISION- INFO/REMARK]/MODIFICATIONS/MODEFICATION Transformed result: //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-REVISION- INFO]/MODIFICATIONS/MODEFICATION XPath expression p3 : //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-DOC-INFO/PRIVATE- CODES/PRIVATE-CODE]/DOC-REVISIONS/DOC-REVISION/COMPANY-REVISION- INFOS/COMPANY-REVISION-INFO/REMARK Transformed result: nil XPath expression p4 : //P[FT]/STD Transformed result: //P[L-1/FT]/L-1/STD Table 4: XPath expression whose results become empty under D30 //table-wrap-group/table-wrap/speech/p/citation/conf-name //table-wrap-group/table-wrap/speech/p/citation/conf-loc //table-wrap-group/table-wrap/speech/p/citation/comment //table-wrap-group/table-wrap/speech/p/citation/article-title //table-wrap-group/table-wrap/speech/p/citation/annotation //list-item/p/contract-num/named-content/ack/p //p/contract-num/named-content/ack/boxed-text/sec //p/contract-sponsor/named-content/verse-group/verse-group/verse-group //statement/p/contract-sponsor/named-content/verse-group/verse-group //statement/p/contract-sponsor/named-content/verse-group/verse-line Table 5: XPath expressions with predicate under D23 and D30 XPath expression p5 : //list-item/p/contract-num/named-content/ack[chem-struct]/p Transformed result: //list-item/p/contract-num/named-content/ack/p XPath expression p6 : //license/p/preformat/named-content/supplementary-material[chem-struct-wrapper]/disp- quote Transformed result: //license/p/preformat/named-content/supplementary-material[chem-struct-wrap]/disp-quote XPath expression p7 : /article/front/journal-meta/custom-meta-wrap[custom-meta/meta-name]/custom-meta/meta- value Transformed result: /article/front/journal-meta/custom-meta-group[custom-meta/meta-name]/custom-meta/meta- value