An Algorithm for Transforming XPath Expressions
                    According to Schema Evolution

                Kazuma Hasegawa                                 Kosetsu Ikeda                 Nobutaka Suzuki
             Graduate School of Library,                 Graduate School of Library,     Faculty of Library, Information
           Information and Media Studies               Information and Media Studies          and Media Science
                University of Tsukuba                       University of Tsukuba           University of Tsukuba
            s1221595@u.tsukuba.ac.jp                     lumely@slis.tsukuba.ac.jp        nsuzuki@slis.tsukuba.ac.jp

ABSTRACT
XML is a de-fact standard format on the Web. In general,
schemas of XML documents are continuously updated ac-
cording to changes in real world. If a schema is updated,
then query expressions have to be transformed so that they
are “valid” under the updated schema, since the expressions
are no longer valid under the updated schema due to the
schema update. However, this is not an easy task since
many of recent schemas are large and complex and thus it is
becoming diﬃcult to know how to update the query expres-
sions correctly. In this paper, we propose an algorithm for
transforming XPath expressions according to schema evolu-
tion. For an XPath expression p and a schema S, our algo-
rithm treats both p and S as tree automata T Ap and T AS ,
respectively. Our algorithm first takes the product automa-
ton T AR of T Ap and T AS , then analyze T AR to find the
correspondence between the states of T Ap and T AS . Based
on this correspondence, the algorithm transforms T Ap ac-
cording to an update operation applied to T AS . We also
show some preliminary experimental results.

Categories and Subject Descriptors
H.2.1 [Database Management]: Logical Design—Data
models

General Terms                                                            Figure 1: An example of XPath transformations
Algorithms

Keywords                                                                real world, which is called schema evolution. If a schema is
                                                                        updated, then query expressions have to be transformed so
XML, XPath, schema evolution, tree automaton                            that they are “valid” under the updated schema, since the
                                                                        expressions are no longer valid against the updated schema
1. INTRODUCTION                                                         due to the schema update. However, this is not an easy task
  XML[5] is a de-fact standard format on the Web. An XML                since many of recent schemas are large and complex, and
document is usually stored with its schema so that the struc-           thus it is becoming very diﬃcult to know how to update the
tural consistency of the document is ensured. In general,               query expressions correctly.
schemas are continuously updated according to changes in                   In this paper, we propose an algorithm for transforming
                                                                        XPath expressions according to schema evolution. Here,
                                                                        XPath[4] is the most popular query language for XML. For
                                                                        a given schema S, an edit operation op to S, and an XPath
                                                                        expression p under S, our algorithm transforms p into an
                                                                        XPath expression p′ “equivalent” to p whenever possible,
This work is licensed under the Creative Commons Attribution-           that is, the result of p′ under op(S) coincides with that of
ShareAlike 3.0 Unported License (CC BY-SA 3.0). To view a copy          p under S, where op(S) is the updated schema obtained by
of the license, visit http://creativecommons.org/licenses/by-sa/3.0/.   applying op to S.
DChanges 2013, September 10th, 2013, Florence, Italy.                      To illustrate our algorithm, let us show a simple example.
ceur-ws.org Volume 1008, http://ceur-ws.org/Vol-1008/paper4.pdf .       Let D be the DTD in Fig. 1(a), and suppose that element
students is inserted as the child of school (Fig. 1(b)). Ac-       An XML document is modeled as a labeled ordered tree,
cording to this schema update, our algorithm transforms the     and a schema is modeled as a tree automaton. Formally, a
following XPath expression                                      tree automaton is a quadruple T A = (N, Σ, s, P ), where

/school/student[supervisor]/name                                     • N is a set of states (element types),
                                                                     • Σ is a finite set of element names,
into the following:
                                                                     • s ∈ N is the start state,
/school/students/student[supervisor]/name.
                                                                     • P is a set of transition rules of the form X → a(reg)
Then, suppose that element supervisor is deleted from the              or X → Y , where X, Y ∈ N and reg is a regular
DTD in Fig. 1(b). In this case, it is impossible to transform          expression over N .
the above XPath expression into an equivalent one. Thus,
                                                                For a transition rule X → a(reg), we say that X is the left-
as an alternative answer our algorithm deletes supervisor
                                                                hand side, a(reg) is the right-hand side, a is the label, and
from the above expression and returns the following:
                                                                reg is the content model of the rule. For example, consider
/school/students/student/name.                                  the tree automaton T Aa shown in Fig. 2. This is equiva-
                                                                lent to the DTD in Fig. 1(a). “School” in N is the type of
Although these DTDs are very small, schemas used in prac-       element “school” , and so on. We assume that element “pc-
tice are much larger and complex [6]. Therefore, a user tends   data” represents an arbitrary string. By L(T A) we mean the
not to understand the entire structure of a schema exactly,     language of tree automaton T A, i.e., the set of trees “valid”
and thus our algorithm is helpful to transform XPath ex-        against T A.
pressions appropriately according to schema evolutions.            Following [15], we define the product of tree automata. Let
   In this paper, schema is modeled as (unranked) tree au-      TA1 = (N1 , Σ1 , s1 , P1 ) and TA2 = (N2 , Σ2 , s2 , P2 ) be tree
tomaton, which is equivalent to regular tree grammar. Tree      automata. Without loss of generality, we assume that Σ1 =
automaton is the formal model of RELAX NG, and XML              Σ2 = Σ. First, we define the product reg 1 ⊕ reg 2 , where reg 1
Schema and DTD can also be modeled by tree automa-              is a regular expression over N1 and reg 2 is a regular expres-
ton [14]. Moreover, tree automaton is equivalent to special-    sion over N2 . Then reg 1 ⊕ reg 2 is a regular expression over
ized DTD [17]. Thus, our algorithm is applicable to most of     N1 × N2 such that n11 n21 · · · ni1 matches reg 1 and n12 n22 · · · ni2
formal and practical XML schema languages. As for XPath,        matches reg 2 if and only if [n11 , n12 ][n21 , n22 ] · · · [ni1 , ni2 ] matches
we focus on an XPath fragment using child and descendant-       reg 1 ⊕ reg 2 , where n11 n21 · · · ni1 is a sequence of states in N1
or-self axes with predicates. Although our XPath fragment       and n12 n22 · · · ni2 is a sequence of states in N2 . reg 1 ⊕ reg 2 is
supports no upward axes, this gives usually little problem      constructed as follows.
since the majority of XPath queries uses only downward             1. reg ′1 is obtained from reg 1 by replacing each state n
axes[10]. Thus, we believe that our algorithm is useful to            by [n1 , n12 ]|[n1 , n22 ]| · · · |[n1 , nk2 ], where n12 , n22 , · · · , nk2
correct a large number of XPath queries.                              is an enumeration of N2 . Similarly, reg ′2 is obtained
                                                                      from reg 2 .
Related Work
The study most related to this paper is [13]. This study           2. Construct two automata A1 and A2 from reg ′1 and
proposes an algorithm for transforming XPath expressions              reg ′2 , respectively.
according to schema evolution, assuming that a schema              3. Construct the product automaton of A1 and A2 .
monotonically increases (no element can be deleted from a
schema). On the other hand, this paper has no such an as-          4. Construct a regular expression, reg 1 ⊕ reg 2 , from the
sumption and allows more general schema updates. Since                product automaton.
actual schema evolutions usually involve element deletions       The product automaton of TA1 and TA2 is TA3 = (N1 ×
or some similar updates, our algorithm is more practical in     N2 , Σ, s1 × s2 , P3 ), where
real world situations. To the best of our knowledge, there
is no study on transforming XPath expressions according to      P3     = {[n1 , n2 ] → a(reg1 ⊕ reg2 ) | (n1 → a(reg1 ) ∈ P1 ,
schema evolution, except [13]. However, several studies deal                n2 → a(reg2 ) ∈ P2 )
with update operations to schemas. For example, Ref. [18]                   ∪{[n1 , n2 ] → [n1 , n′2 ] | (n1 ∈ N1 , n2 → n′2 ∈ P2 )}
proposes a “complete” set of update operations to DTDs.
Refs. [12, 9] propose update operations schemas and algo-                   ∪{[n1 , n2 ] → [n′1 , n2 ] | (n1 → n′1 ∈ P1 , n2 ∈ N2 )}}.
rithms for extracting “diﬀ” between two schemas. Refs. [7,      By definition, it is immediate that for any tree t, t ∈ L(T A3 )
8, 19] propose update operations that assures any updated       if and only if t ∈ L(T A1 ) and t ∈ L(T A2 ).
schema contains its original schema. Recently, Ref. [16]           Let Σ be a set of element names. We define XPath ex-
introduces a taxonomy of possible problems induced by a         pression p as follows.
schema change, and gives an algorithm to detect such prob-
lems. Ref. [11] studies query-update independence analysis,          • p ::= /p′
and shows that the performance of [3] can be drastically             • p′ ::= χ :: l | p′ /p′ | p′ [q], where l ∈ Σ
enhanced in the use of µ-calculus.
                                                                     • χ ::=child | descendant-or-self

2. DEFINITIONS                                                       • q ::= p′
   In this section, we define tree automaton and the product    We call each χ ∈ {child, descendant-or-self} axis, and q pred-
of tree automata.                                               icate, respectively.
        N   =   {School, Student, ID, Name, Address, Supervisor, Pcdata},
        Σ   =   {school, student, id, name, address, supervisor, pcdata},
        P   =   {School → school(Student∗ ), Student → student(ID Name Address Supervisor?), ID → id(Pcdata),
                Name → name(Pcdata), Address → address(Pcdata), Supervisor → supervisor(Pcdata), Pcdata → pcdata(ϵ)}.


                                    Figure 2: Tree automaton T Aa = (N, Σ, School, P )


                                                                          • del opr(A, a, i): Deletes the operator at position i in
                                                                            the content model of r, where r is the transition rule
                                                                            in P such that the left-hand side of r is A and that the
                                                                            label of r is a.
                                                                          • nest state(A, a, B, b, i): Replaces the subexpression E
                                                                            at position i (i.e., subtree rooted at position i) in con-
                                                                            tent model of r by state B, where r is the transition
                                                                            rule in P such that the left-hand side of r is A and
       Figure 3: Tree representation of (A|B)∗                              that the label of r is a. Moreover, a transition rule
                                                                            B → b(E) is added to P .
                                                                          • unnest state(A, a, b, i): This is the inverse operation
3. UPDATE OPERATION TO TREE AU-                                             of nest state, and replaces the state B at position i
   TOMATON                                                                  in the content model of r by reg, where (1) r is the
   In this section, we define update operations to tree au-                 transition rule in P such that the left-hand side of r is
tomaton.                                                                    A and that the label of r is a and (2) reg is the content
   Let reg be a regular expression. To define update oper-                  model of the transition rule whose left-hand side is B
ations to a tree automaton, we need to identify the posi-                   and label is b.
tions of states and operators in reg. Thus we define the set
                                                                          • replace state(A, a, B, i): Replaces the state at position
pos(reg) of positions in a reg, as follows.
                                                                            i in the content model of r by B. In this paper, this
   • If reg = ϵ or reg = a (a ∈ Σ), then pos(reg) = {λ}.                    operation is treated as a pair of unnest state(A, a, i)
                                                                            and nest state(A, a, B, b, i). This operation is used in
   • Otherwise, pos(reg) = {λ} ∪ {u | u = vw, 1 ≤ v ≤                       order to “rename” element names.
     n, w ∈ pos(reg v )}, where reg v is a subexpression con-          For example, consider the tree automaton T Aa in Fig. 2.
     sisting of the descendants of v in reg.                         Applying nest state(School, school, Students, students, λ)
                                                                     to T Aa , we obtain the tree automaton T Ab in
For example, let reg = (A|B)∗ .            Then pos(reg) =           Fig. 4. Then applying del opr(Student, student, 4) and
{λ, 1, 11, 12}. As shown in Fig. 3, λ is the position of ∗,          del state(Student, student, 4) to Tb , we obtain the tree
1 is the position of |, 11 is the position of A, and 12 is the       automaton T Ac in Fig. 5.
position of B.
  In the following, without loss of generality we assume
that for any state A and any element name a, there is at             4.     TRANSFORMATION FROM XPATH
most one transition rule whose left-hand side is A and la-                  EXPRESSION INTO TREE AUTOMA-
bel is a. If there are transition rules A → a(reg1 ) and                    TON
A → a(reg2 ), then these can be merged into one transition
                                                                        Our algorithm treats an XPath expression as a tree au-
rule A → a(reg1 |reg2 ).
                                                                     tomaton. Thus, in this section we define a transformation
  We now define update operations to a tree automaton
                                                                     from an XPath expression into a tree automaton. This
T A = (N, Σ, s, P ) as follows.
                                                                     transformation is based on [15].
   • ins state(A, a, B, i): Inserts new state B at position i           An XPath expression is transformed into a tree automaton
     in the content model of r, where r is the transition rule       as follows. First, we transform a location step χ :: l into a
     in P such that the left-hand side of r is A and that the        tree automaton. Each transformed tree automaton has input
     label of r is a.                                                and output states, corresponding to the “context node” for
                                                                     χ :: l and the “result node” selected by χ :: l, respectively.
   • ins opr(A, a, opr, i): Inserts an operator opr (∗, +, ?)        For an XPath expression p of the form p1 /p2 , p′ [q], or /p′ , we
     at position i in the content model of r, where r is the         transform p into a tree automaton in a bottom-up manner,
     transition rule in P such that the left-hand side of r is       according to the structure of p.
     A and that the label of r is a.                                    We first define a tree automaton (N, Σ, s, P ) correspond-
                                                                     ing to location step χ :: l. In the following, A → σ(· · · )
   • del state(A, a, i): Deletes the state at position i in the      denotes {A → σ(· · · ) | σ ∈ Σ}.
     content model of r, where r is the transition rule in                • If χ = child, then
     P such that the left-hand side of r is A and that the
     label of r is a.                                                          – The initial state s is B.
N   =     {School, Students, Student, ID, Name, Address, Supervisor, Pcdata},
Σ   =     {school, students, student, id, name, address, supervisor, pcdata},
P   =     {School → school(Students), Students → students(Student∗ ), Student → student(ID Name Address Supervisor?),
          ID → id(Pcdata), Name → name(Pcdata), Address → address(Pcdata), Supervisor → supervisor(Pcdata), Pcdata → pcdata(ϵ)}.


                                       Figure 4: Tree automaton T Ab = (N, Σ, School, P )


    N     =    {School, Students, Student, ID, Name, Address, Pcdata},
    Σ     =    {school, students, student, id, name, address, pcdata},
     P    =    {School → school(Students), Students → students(Student∗ ), Student → student(ID Name Address), ID → id(Pcdata),
               Name → name(Pcdata), Address → address(Pcdata), Pcdata → pcdata(ϵ)}.


                                       Figure 5: Tree automaton T Ac = (N, Σ, School, P )


          – P consists of the following transition rules.                The case of p[q]
             ∗ A → σ(A∗ )                                                Let T Ap = (Np , Σ, sp , Pp ) be the tree automaton of p and
             ∗ B → σ(A∗ (B|C)A∗ )                                        T Aq = (Nq , Σ, sq , Pq ) be the tree automaton of q. Moreover,
             ∗ B→C                                                       let T A = (N, Σ, s, P ) be the product automaton T Ap and
             ∗ C → σ(A∗ DA∗ )                                            T Aq . Then the tree automaton T Ap[q] of p[q] is defined as
             ∗ D → l(A∗ )                                                T Ap[q] = (Np[q] , Σ, s, P ), where

    • If χ = descendant-or-self, then                                           Np[q]   =    {[np , nq ] | (np ∈ N Op , nq ∈ N Iq ) ∨
                                                                                             (np ∈ (Np − N Op ), nq ∈ (Nq − N Iq ))}.
          – Initial state s is B1 .
          – P consists of the following transition rules.                  The set N Ip[q] of input states and the set N Op[q] of output
              ∗ A → σ(A∗ )                                               states of T Ap[q] are defined as follows.
              ∗ B1 → σ(A∗ (B1 |C)A∗ )                                         N Ip[q]   = {[np , nq ] | np ∈ N Ip , nq ∈ (Nq − N Iq )},
              ∗ B1 → C                                                        N Op[q]   = {[np , nq ] | (np ∈ N Op , nq ∈ N Iq )}.
              ∗ C → σ(A∗ (B2 |D)A∗ )
              ∗ C→D                                                      The case of /p
              ∗ B2 → σ(A∗ (B2 |D)A∗ )                                    Let χ :: l be the first location step in p and let T Ap =
              ∗ D → l(A∗ )                                               (Np , Σ, sp , Pp ) be the tree automaton of p. Then the tree
                                                                         automaton T A/p of /p is defined as T A/p = (N/p , Σ, R, P/p ),
The input state and the finite set of output states of a finite          where
tree automata are {C} and {D}, respectively. C represents
the context node and D represents the node selected by the                                  N/p   =   Np ∪ {R},
location step χ :: l.                                                                       P/p   =   Pp ∪ {R → root(D)},
   We next define the transformation from an XPath expres-
sion p into a tree automaton. We have to consider the case               where R is the initial state of T A/p and root is the element
where p = p1 /p2 , p = p[q], and p = /p. In the following, for           name corresponding to the root node.
a tree automaton T Ap of p, Np denotes the set of states of                The set N I/p of input states and the set N O/p of output
T Ap , N Ip ⊂ Np denotes the set of input states of T Ap , and           states of T A/p are defined as follows.
N Op ⊂ Np denotes the set of output states of T Ap .                                               N I/p    = {R},
The case of p1 /p2                                                                                N O/p     = N Op ,
Let T Ap1 = (Np1 , Σ, sp1 , Pp1 ) be the tree automaton of p1            where N Op is the set of output states of T Ap .
and T Ap2 = (Np2 , Σ, sp2 , Pp2 ) be the tree automaton of p2 .
Moreover, let T A = (N, Σ, s, P ) be the product automaton
of T Ap1 and T Ap2 . Then the automaton T Ap1 /p2 of p1 /p2
                                                                         5.     ALGORITHM FOR TRANSFORMING
is defined as T Ap1 /p2 = (Np1 /p2 , Σ, s, P ), where                           XPATH EXPRESSION ACCORDING TO
Np1 /p2     = {[np1 , np2 ] | (np1 ∈ N Op1 , np2 ∈ N Ip2 ) ∨                    SCHEMA EVOLUTION
                 (np1 ∈ (Np1 − N Op1 ), np2 ∈ (Np2 − N Ip2 ))}.            In this section, we show an algorithm for transforming a
                                                                         given XPath expression according to schema evolution.
The set N Ip1 /p2 of input states and the set N Op1 /p2 of out-            To describe our algorithm, we need some definitions. For
put states of T Ap1 /p2 are defined as follows.                          an XPath expression p, the selection path of p is the XPath
                                                                         expression obtained by dropping every predicate from p.
N Ip1 /p2     = {[np1 , np2 ] | np1 ∈ N Ip1 , np2 ∈ (Np2 − N Ip2 )},
                                                                         Let p be an XPath expression of the form /p1 [q]/p2 . Then
N Op1 /p2     = {[np1 , np2 ] | np1 ∈ (Np1 −N Op1 ), np2 ∈ N Op2 }.      /p1 /q is called a predicate path of p (the predicate path for
a predicate in q can be defined similarly). For example, Let                    – If all the result elements retrieved by p disappear
p = /a[f ]/b[c/e]/d. Then /a/b/d is the selection path of p,                      from T A due to op, the result of p becomes empty.
and /a/f and /a/b/c/e are the predicate paths of p.                               Then our algorithm returns nil.
   Let us first show the “main” algorithm (Fig. 6). Let op be
                                                                                – Otherwise, some element of op(T A) can still be
the update operation applied to tree automaton T A. More-
                                                                                  retrieved by p. Thus our algorithm transforms p
over, let p be the input XPath expression, p0 be the selection
                                                                                  so that such elements are retrieved.
path of p and p1 , · · · , pk be the predicate paths of p. We
transform pi to p′i according to op, for each i = 0, 1, · · · , k,
and merge the resulting k expressions p′0 , p′1 , · · · , p′k as the      Now          we       present  function    Transform   (Fig.7).
result. More concretely, the algorithm first partition p into          state(A, a, i, P ) in steps 16, 23, and 29 denotes the
p0 , p1 , · · · , pk (step 1). Then p0 is transformed into p′0 ac-     state at position i in the content model of r, where r is
cording to op by function Transform (shown later). If the              the transition rule in P such that its left-hand side is A
result of p0 becomes empty due to del state, then p′0 = nil            and its label is a. If op is ins state, ins opr, or del opr, p
and the transformation of p is terminated (steps 3 and 4).             is unchanged (steps 4 to 7). If op is del state, we examine
Otherwise, we transform pi for each i = 1, · · · , k (steps            whether the result of p becomes empty under op(T A) (steps
6 to 8), by Transform. Finally, p0 , p1 , · · · , pk are merged        8 to 11). If the result of p does not become empty, p is
and returned as the result. If a predicate path p′i is nil,            unchanged (steps 11 to 12). If the result of p becomes
then p′i is just ignored when merged. For example, let                 empty and p is the selection path, the algorithm returns nil
p = /a[f ]/b[c/e]/d. Then p0 = /a/b/c, p1 = /a/f , and                 (steps 13 to 14). Otherwise (i.e., p is a predicate path), we
p2 = /a/b/c/e. Suppose that c is deleted by unnest state.              delete the suﬃx of p that becomes invalid due to op (steps
Then we obtain p′0 = /a/b/d, p′1 = /a/f , and p′2 = /a/b/e             15 to 20). In step 17, P ′′ is obtained in step 2, and we
due to Transform, and merging these three expressions we               say that r is a transition rule from A to C if the left-hand
have p′ = /a[f ]/b[e]/d, which is the result of the algorithm.         side of r contains A and a state in the content model of
   Let us next consider function Transform. Let T A =                  r contains C. In step 18, we say that lsj corresponds to
(N, Σ, s, P ) be a tree automaton, op be an edit operation             r ∈ P ′′ if r is the intersection of (i) the transition rule
to T A, and p be an XPath expression having no predicate.              corresponding to lsj in Pp and (ii) some transition rule in
Our objective is to transform p into an XPath expression p′            P . If op is nest state(A, a, B, b, i), we examine whether we
so that the result of p′ under op(T A) coincides with that of p        need to transform p (steps 23 to 24). If we need to do, we
under T A. However, this is sometimes impossible if a state            insert a new location step child :: b to p (steps 25 and 26). If
(and its corresponding element) is deleted from T A. Thus,             op is unnest state(A, a, b, i), then we also examine whether
Transform is constructed as follows.                                   we need to transform p (steps 30 to 31). Moreover, if the
                                                                       location step lsj+1 to be deleted has a predicate q and the
   • If op is ins state, ins opr, or del opr, then p is un-            axis of lsj+1 is descendant-or-self, we delete q (steps 32 to
     changed.                                                          33). Otherwise, we delete lsj+1 from p (steps 34 to 38).
                                                                          Finally, let us present the time complexity of the al-
   • If op is nest state(A, a, B, b, i), then location step
                                                                       gorithm. Let T A = (N, T, s, P ) be a tree automaton
     child :: b is inserted to p at the position that nest-
                                                                       (schema) p be an XPath expression that is partitioned into
     ing is occurred, if necessary (i.e., unless p contains a
                                                                       p0 , p1 , p2 , · · · , pk . Then the algorithm runs in O(k · |p|2 ·
     descendant-or-self axis that “masks” the nest state op-
                                                                       |T A|), where |p| is the number of location steps in p and
     eration).
                                                                       |T A| is the size of T A.
   • If op is unnest state(A, a, b, i), then a location step
     whose node test is b is deleted from p, if it is not
     “masked” by a descendant-or-self axis.                            6.   EXPERIMENTAL RESULTS
                                                                         To verify if our algorithm transforms XPath expressions
   • If op is del state, then p is modified as follows.                appropriately under real world schemas, we implemented
                                                                       our algorithm (in Ruby) and made a few experimentations.
Input : XPath expression p, tree automaton T A, update opera-          We use two pairs of schemas, MSRMEDOC DTDs (version
tion op to T A.                                                        2.1.1 and 2.2.2)[2] and the NLM Journal Publishing Tag Set
Output : XPath expression or nil
                                                                       Tag Library DTDs (version 2.3 and 3.0)[1].
   1. Partition p into the selection path p0 and the predicate path      First, we give the evaluation of our algorithm on MSRME-
      p 1 , p2 , · · · , p k .
                                                                       DOC DTDs. Let D211 be the version 2.1.1 MSRMEDOC
   2. p′0 ← Transform(p0 , T A, op);                                   DTD and D222 be the version 2.2.2 MSRMEDOC DTD. The
   3. if p′0 = nil then                                                number of elements of D211 is 185 and that of D222 is 205.
   4.    return nil;                                                   Table 1 shows the number of update operations between
                                                                       D211 and D222 , where “others” are attributes insertions (not
   5. else
                                                                       supported by our algorithm).
   6.    for i ← 1 to k do                                               We generate 90 XPath expressions under D211 by XQ-
   7.      p′i ← Transform(pi , T A, op);                              gen[20]. The average size (i.e., number of location steps) of
   8.    end                                                           the XPath expressions is 5, where the minimum size is 4 and
   9. Merge p′0 , p′1 , · · · , p′k into p′ .                          the maximum size is 7. Each XPath expression uses child
                                                                       axes and at most one descendant-or-self axes (78 XPath ex-
  10. return p′ ;
                                                                       pressions use a descendant-or-self axis). There are 5 XPath
                                                                       expressions whose result becomes empty under D222 . Since
                    Figure 6: Main algorithm                           there is no predicate in these 90 XPath expressions, we
                Table 1: Update operations between D211 and D222 and between D23 and D30
                 ins state del state nest state unnest state replace state ins opr/del opr others                         total
    D211 → D222         61        11         27            0             0              60     32                          191
    D23 → D30          504        82          3            0            24               0    120                          733


                                                                       choose 4 expressions and add a predicate to each of 4 ex-
                                                                       pressions (given later).
Function Transform(p, T A, op)                                            We transform the above 90 XPath expressions by our al-
Input : XPath expression p = /ls1 / · · · /lsm , tree automaton        gorithm. For 85 expressions, the elements retrieved under
T A = (N, Σ, s, P ), update operation op to T A.                       D222 coincides with the elements retrieved under D211 . For
Output : XPath expression or nil                                       the rest 5 XPath expressions, our algorithm returns nil since
  1. Construct the tree automaton T Ap = (Np , Σ, sp , Pp ) of p       the results of these expressions become empty under D222 .
  2. Construct the product automaton T A′′ = (ND ×                     These 5 XPath expressions are shown in Table 2 (deleted
     Np , Σ, sD × sp , P ′′ ) of T A and T Ap                          elements are italicized). Since elements LIST, NOTE, FIG-
  3. switch op                                                         URE, and FORMULA (child elements of REMARK) and an
  4. case ins state(A, a, B, i)                                        element PRIVATE-CODES (child element of COMPANY-
  5. case ins opr(A, a, opr, i)                                        DOC-INFO) are deleted, the results of these XPath expres-
  6. case del opr(A, a, i)                                             sions become empty (note that if an element is deleted from
  7. return p;                                                         a content model, then its descendants cannot be retrieved
  8. case del state(A, a, i)                                           either).
  9.   Construct the tree automaton T A′ = (N ′ , Σ, s′ , P ′ ) of        The four XPath expressions with a predicate, denoted
     op(T A)                                                           p1 , p2 , p3 , p4 , are transformed as follows (Table 3).
 10. Construct the product automaton T A′p = (N ′ ×Np , Σ, s′ ×
     sp , Pp′ ) of T A′ and T Ap                                          • Since element REMARK (child element of
 11. if L(T A′p ) ̸= ∅ then                                                 COMPANY-REVISION-INFO) is deleted,  the
 12.      return p;                                                         predicate of p1 is deleted.
 13. else if p is the selection path then
                                                                          • Since element REMARK is deleted, the last location
 14.      return nil;
                                                                            step in the predicate of p2 is deleted.
 15. else
 16.      C ← state(A, a, i, P ) // C is deleted                          • Since element REMARK is deleted, the location
 17.      if P ′′ contains a transition rule r from A to C then             step corresponding to an element REMARK in p3 is
 18.        Let lsj be the location step in p corresponding to r;           deleted, therefore our algorithm return nil.
 19.        p ← /ls1 / · · · /lsj−1 ;
 20.      end                                                             • Since element L-1 is inserted between P and FT and
 21. end                                                                    between P and STD, two location steps L-1 are in-
 22. case nest state(A, a, B, b, i)                                         serted to p4 .
 23. C ← state(A, a, i, P ) // C is nested
 24. if P ′′ contains a transition rule r from A to C then                As shown above, we can say that every XPath expres-
 25.      Let lsj be the location step in p corresponding to r;        sion is transformed appropriately by our algorithm, even if
 26.      p ← /ls1 / · · · /lsj /child :: b/lsj+1 / · · · /lsm ;       del state’s are involved in the schema evolution. The total
 27. end                                                               execution time of the algorithm for the 90 XPath expres-
 28. case unnest state(A, a, b, i)
                                                                       sion and 191 update operations is 9.58 sec, thus it takes an
 29. C ← state(A, a, i, P ) C is unnested
                                                                       average 0.111 sec per one XPath expression.
 30. if P ′′ contains a transition rule r from A to C then
                                                                          We also made a similar experimentation using the NLM
 31.      Let lsj be the location step in p corresponding to r;
                                                                       Journal Publishing Tag Set Tag Library. Let D23 be the
 32.     if the axis of lsj+1 is descendant-or-self and lsj+2 is the
                                                                       version 2.3 The NLM Journal Publishing Tag Set Tag Li-
     first location step of a predicate then                           brary DTD and D30 be the version 3.0 DTD. The number
 33.        p ← /ls1 / · · · /lsj−1 ;                                  of elements of D23 is 211 and that of D30 is 233. Table 1
 34.      else                                                         shows the number of update operations between D23 and
 35.        axis ← the axis of lsj+1 ;                                 D30 .
 36.        l ← the node test of lsj+2 ;                                  We generate 97 XPath expressions under D23 by XQgen.
 37.        p ← /ls1 / · · · /lsj /axis :: l/lsj+3 / · · · /lsm ;      The average size of the XPath expressions is 6. Each XPath
 38.      end                                                          expression uses child axes and at most once descendant-or-
 39. end                                                               self axes, where 95 XPath expressions include descendant-
 40. end                                                               or-self axes. There are 10 XPath expressions whose result
 41. return p;                                                         becomes empty under D30 . There is no XPath expression
                                                                       with a predicate, thus we choose 3 XPath expressions and
                                                                       added a predicate to each expression.
             Figure 7: Function Transform
                                                                          We transform the 97 XPath expressions by our algorithm.
                                                                       For 87 expressions, the elements retrieved under D30 coin-
                                                                       cides with the elements retrieved under D23 . For the rest
10 XPath expressions, our algorithm returns nil since the          [7] G. Guerrini, M. Mesiti, and D. Rossi. Impact of XML
results of these expressions become empty under D30 , which            schema evolution on valid documents. In Proc.
are listed in Table 4. Since elements citation, contract-num,          WIDM, pages 39–44, 2005.
contract-sponsor in element p are deleted, the results of          [8] K. Hashimoto, Y. Ishihara, and T. Fujiwara. A
these XPath expressions become empty. The three XPath                  proposal of update operations for schema evolution in
expressions with a predicate, denoted p5 , p6 , p7 , are trans-        XML databases and their properties on preservation
formed as follows (Table 5).                                           of schema’s expressive power. IEICE Transactions on
                                                                       Information and Systems (Japanese Edition),
   • Since element chem-struct (child element of ack) is               J90-D(4):990–1004, 2007.
     deleted, the predicate of p5 is deleted.
                                                                   [9] K. Horie and N. Suzuki. Extracting diﬀerences
   • Since element chem-struct-wrapper is renamed to                   between regular tree grammars. In Proc. ACM SAC,
     chem-struct-wrap, the predicate of p6 is renamed ac-              pages 859–864, 2013.
     cordingly.                                                   [10] Z. G. Ives, A. Y. Halevy, and D. S. Weld. An XML
                                                                       query engine for network-bound data. The VLDB
   • Since element custom-meta-wrap is renamed to                      Journal, 11(4):380–402, 2002.
     custom-meta-group, the corresponding location step in        [11] M. Junedi, P. Genevès, and N. Layaı̈da. Xml
     p7 is renamed to custom-meta-group.                               query-update independence analysis revisited. In
                                                                       Proceedings of the 2012 ACM symposium on
Again our algorithm seems to work well despite of del state            Document engineering, DocEng ’12, pages 95–98, 2012.
operations. These results suggest that our algorithm can be
                                                                  [12] E. Leonardi, T. T. Hoai, S. S. Bhowmick, and
applied to XPath expressions under real world schema evo-
                                                                       S. Madria. DTD-Diﬀ: a change detection algorithm
lutions. The total execution time of the algorithm for the 97
                                                                       for DTDs. In Proc. DASFAA, pages 817–827, 2006.
XPath expression and 733 update operations is 77.561 sec,
                                                                  [13] T. Morimoto, K. Hashimoto, Y. Ishinara, and
thus it takes an average 0.800 sec per one XPath expression.
                                                                       T. Fujiwara. Translating XPath queries according to
                                                                       XML schema evolution based on the tree-embedding
7. CONCLUSION                                                          relation. In IEICE Technical Report, vol.107, no.131,
  In this paper, we proposed an algorithm for transforming             DE2007-40, pages 109–114, 2007 (in Japanese).
an XPath expression according to schema evolution allow-          [14] M. Murata, D. Lee, M. Mani, and K. Kawaguchi.
ing element deletions. However, this is just an ongoing work           Taxonomy of XML schema languages using formal
and we have a lot to do. First, we would like to extend                language theory. ACM Transactions on Internet
our algorithm so that it can handle more general XPath ex-             Technology, 5:660–704, 2005.
pressions, e.g., supporting sibling and parent axes. Second,      [15] A. Ohno, Y. Ishihara, and T. Fujiwara. Method of
we use only two schema evolutions in our experimentation.              analyzing the correspondence between types defined
Thus we would like to evaluate our algorithm under more                by an XML schema and XPath subexpressions. In
real world schemas. Third, our algorithm supports neither              IPSJ SIG-Report, 2009-MPS-75(20), pages 1–7, 2009
attribute nor entity declaration in schemas. These should              (in Japanese).
be incorporated into our algorithm.                               [16] R. Oliveira, P. Genevès, and N. Layaı̈da. Toward
                                                                       automated schema-directed code revision. In
8. ACKNOWLEDGEMENT                                                     Proceedings of the 2012 ACM symposium on Document
   This work is partly supported by Grants-in-aid for Scien-           engineering, DocEng ’12, pages 103–106, 2012.
tific Research (23500110).                                        [17] Y. Papakonstantinou and V. Vianu. DTD inference for
                                                                       views of XML data. In Proc. ACM PODS, pages
                                                                       35–46, 2000.
9. REFERENCES                                                     [18] H. Su, D. Kramer, L. Chen, K. Claypool, and E. A.
 [1] “Journal Publishing Tag Set” NLM Journal Archiving                Rundensteiner. XEM: managing the evolution of XML
     and Interchange Tag Suite.                                        documents. In Proc. IEEE RIDE, pages 103–110,
     http://dtd.nlm.nih.gov/publishing/.                               2001.
 [2] “MSR Download” MSR Home.                                     [19] N. Suzuki. An edit operation-based approach to the
     http://www.msr-wg.de/medoc/downlo.html.                           inclusion problem for DTDs. In Proc. ACM SAC,
 [3] M. Benedikt and J. Cheney. Destabilizers and                      pages 482–488, 2007.
     independence of xml updates. Proc. VLDB Endow.,              [20] Y. Wu, N. Lele, R. Aroskar, S. Chinnusamy, and
     3(1-2):906–917, 2010.                                             S. Brenes. XQGen: an algebra-based xpath query
 [4] A. Berglund, S. Boag, D. Chamberlin, M. F.                        generator for micro-benchmarking. In Proc. CIKM,
     Fernandez, M. Kay, J. Robie, and J. Simeon, editors.              CIKM ’09, pages 2109–2110, 2009.
     XML Path Language (XPath) 2.0 (Second Edition).
     http://www.w3.org/TR/xpath20/.
 [5] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler,
     and F. Yergeau, editors. Extensible Markup Language
     (XML) 1.0 (Fifth Edition).
     http://www.w3.org/TR/xml/.
 [6] B. Choi. What are real DTDs like? In Proc. WebDB,
     pages 43–48, 2002.
              Table 2: XPath expressions whose results become empty under D222
 /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/LIST/ITEM
 /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/NOTE
 /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/FIGURE/DESC
 /MSRREP/MATCHING-DCIS/MATCHING-DCI/REMARK/FORMULA/FORMULA-CAPTION/LONG-NAME
 //ADMIN-DATA/COMPANY-DOC-INFOS/COMPANY-DOC-INFO/PRIVATE-CODES/PRIVATE-CODE


                        Table 3: XPath expressions with predicate under D211 and D222
XPath expression p1 :   //DOC-REVISION/COMPANY-REVISION-INFOS/COMPANY-REVISION-
                        INFO[REMARK]/COMPANY-REF
Transformed result:     //DOC-REVISION/COMPANY-REVISION-INFOS/COMPANY-REVISION-INFO/COMPANY-REF
XPath expression p2 :   //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-REVISION-
                        INFO/REMARK]/MODIFICATIONS/MODEFICATION
Transformed result:     //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-REVISION-
                        INFO]/MODIFICATIONS/MODEFICATION
XPath expression p3 :   //DOC-REVISION[COMPANY-REVISION-INFOS/COMPANY-DOC-INFO/PRIVATE-
                        CODES/PRIVATE-CODE]/DOC-REVISIONS/DOC-REVISION/COMPANY-REVISION-
                        INFOS/COMPANY-REVISION-INFO/REMARK
Transformed result:     nil
XPath expression p4 :   //P[FT]/STD
Transformed result:     //P[L-1/FT]/L-1/STD


                      Table 4: XPath expression whose results become empty under D30
                       //table-wrap-group/table-wrap/speech/p/citation/conf-name
                       //table-wrap-group/table-wrap/speech/p/citation/conf-loc
                       //table-wrap-group/table-wrap/speech/p/citation/comment
                       //table-wrap-group/table-wrap/speech/p/citation/article-title
                       //table-wrap-group/table-wrap/speech/p/citation/annotation
                       //list-item/p/contract-num/named-content/ack/p
                       //p/contract-num/named-content/ack/boxed-text/sec
                       //p/contract-sponsor/named-content/verse-group/verse-group/verse-group
                       //statement/p/contract-sponsor/named-content/verse-group/verse-group
                       //statement/p/contract-sponsor/named-content/verse-group/verse-line


                        Table 5: XPath expressions with predicate under D23 and D30
XPath expression p5 :   //list-item/p/contract-num/named-content/ack[chem-struct]/p
Transformed result:     //list-item/p/contract-num/named-content/ack/p
XPath expression p6 :   //license/p/preformat/named-content/supplementary-material[chem-struct-wrapper]/disp-
                        quote
Transformed result:     //license/p/preformat/named-content/supplementary-material[chem-struct-wrap]/disp-quote
XPath expression p7 :   /article/front/journal-meta/custom-meta-wrap[custom-meta/meta-name]/custom-meta/meta-
                        value
Transformed result:     /article/front/journal-meta/custom-meta-group[custom-meta/meta-name]/custom-meta/meta-
                        value