Introduction

Finding K Optimum Edit Scripts between an XML Document and a Regular Tree Grammar

0 Graduate School of Library, Information and Media Studies University of Tsukuba 1-2, Kasuga , Tsukuba, Ibaraki 305-8550 , Japan

2 16

Finding optimum and near optimum edit scripts between an XML document and a schema is essential to correcting invalid XML documents. In this paper, we consider finding K optimum edit scripts between an XML document and a regular tree grammar. We first prove that the problem is NP-hard. We next show a pseudopolynomial-time algorithm for solving the problem.

Introduction

An edit script between an (invalid) XML document and a schema is a sequence of edit operations to transform the document into a valid one against the schema. Thus finding such an edit script, especially an optimum one, is essential to correcting invalid XML documents. On the other hand, we often have to find near optimum edit scripts as well as an optimum one, since an optimum edit script is not necessarily the real “best” solution. In this paper, we consider an algorithm for finding K optimum edit scripts between an XML document and a regular tree grammar.

There are three common tree grammars to describe an XML schema; local tree grammar (corresponding to DTD), single-type tree grammar (corresponding to W3C XML Schema), and regular tree grammar (corresponding to RELAX NG) [ 8 ]. It is shown that single-type tree grammar is more expressive than local tree grammar, and regular tree grammar is more expressive than single-type tree grammar. Therefore, our algorithm is applicable to a local tree grammar and a single-type tree grammar as well. The expressive power of regular tree grammar is equivalent to that of specialized DTD [ 10 ].

An XML document is modeled as a labeled ordered tree, where a text node is labeled by pcdata (e.g., Fig. 1(a)). An edit script is a sequence of edit operations, where each edit operation is either ren (rename the label of a node), add (add a new node), or del (delete a node). Each edit operation is associated with a cost. Let t be a tree, G be a regular tree grammar, s be an edit script, and s(t) be the tree obtained by applying s to t. We say that s is an edit script between t and G if s(t) is valid against G. s1, s2, · · · , sK are called K optimum edit scripts between t and G if for every 1 ≤ k ≤ K sk is an edit script between t and G such that sk has the kth least cost among the edit scripts between t and G. name n3 first n6 last n7 name n9 n4dept address n10 (b)

name n3 first n6 last n7 name n9 n4dept address n10 pcdatan12pcdatan14pcdatan16pcdatan18 pcdatan12pcdatan14pcdatan16pcdatan18 (c) n6F C n12 n9N2 C n16

D n4

A n10

C n18

We often have to correct invalid XML documents if we manage a document database, since (i) documents and the schema on a database are changed over time and (ii) documents migrated into the database are not necessarily valid against the schema. Unfortunately, such documents and a schema may be much complicated, and the amount of documents you need to correct is often very large. Therefore, finding manually a correct transformation for each invalid document is surely impractical. On the other hand, our algorithm can present K optimum edit scripts between an XML document and a regular tree grammar, which are reasonable solutions to correcting the document since the real “best” solution is likely to be an optimum or near optimum edit script between the document and the grammar.

Finding an edit script between data has originally been studied in terms of strings [ 7, 11 ]. Then there have been studies on finding the minimum edit distance between a string and a language (e.g., [ 13 ]). However, these studies consider only strings and deal with neither a tree nor a tree grammar. So far many algorithms that find an optimum edit script between two ordered trees or XML documents have been proposed (e.g., [ 4, 5, 9, 14 ]). However, these algorithms consider neither schema nor finding K optimum edit scripts. The algorithm in [ 1 ] measures the distance between an unordered XML document (unordered tree) and a DTD, and the algorithm in [ 3 ] finds the minimum edit distance between an XML document and a regular hedge grammar. However, these algorithms only measures a distance and do not present any actual edit script. The author proposes an algorithm for finding an optimum edit script between an XML document and a DTD [ 12 ], but DTD is strictly less expressive than regular tree grammar and the algorithm finds only one edit script rather than K optimum edit scripts. Since an optimum edit script is not always the best solution, finding K optimum edit scripts is a more practically important problem. For example, let t be the tree in Fig. 2(a) and D be the DTD in Fig. 2(b). Assume that the costs of ren, add, del are 1, 2, 3, respectively. Then the optimum edit script between t and D is to rename n7 to i, the second optimum one is to add a new node labeled by ul between n3 and n7, and the third one is to delete n3 or n7. It is very likely that the second or the third one is the best solution.

In this paper, we first prove that finding K optimum edit scripts between an XML document and a regular tree grammar is NP-hard, then show a pseudopolynomial-time algorithm for solving the problem. An XML document is modeled as a labeled ordered tree. For simplicity, attributes are not considered. In what follows, we use the term tree when we mean labeled ordered tree. A leaf node represents a text node and an internal node represents an element. Each node n is labeled by a symbol called terminal, denoted λ(n). For example, λ(n1) = staff and λ(n3) = name in Fig. 1(a). We assume that for any leaf node n, λ(n) = pcdata.

For the simplicity of the algorithm, we denote each node of a tree t as follows (Fig. 1(a) is an example).

1. The root node of t is denoted n1. 2. Let ni be a node in t, and let nj be the node next to ni in breath-first order. If ni and nj are siblings, then j = i + 1, otherwise j = i + 2.

No node in t is renumbered even if a new node is added to t or a node in t is deleted. By tni we mean the subtree of t rooted at node ni. Similarly, for consecutive siblings nh, · · · , nj in t, by tnh,nj we mean the subforest of t rooted at nh, · · · , nj (Fig. 1(b) is an example). We say that the topmost nodes n3, n4 in Fig. 1(b) are the roots of tn3,n4 . By ch(t, ni), we mean the sequence of the children of a node ni in a forest t (e.g., ch(t, n1) = n3, n4 in Fig. 1(a)).

A regular tree grammar is a 4-tuple G = (N, Σ, S, P ), where N is a set of non-terminals, Σ is a set of terminals, S ⊆ N is a set of start symbols, and P is a set of productions of the form X → a(r) such that X ∈ N , a ∈ Σ, and that r is a regular expression over N . For a production X → a(r), we say that X is the left-hand side, a is the right-hand side, and r is the content model of the production. Let t be a forest and ν be a mapping from every node n in t to a non-terminal ν(n) ∈ N . For a sequence nh, · · · , nj of nodes, we define that ν(nh, · · · , nj ) = ν(nh) · · · ν(nj ). We say that ν is an interpretation of t against G if (i) ν(n) ∈ S for every root n of t (if t is a forest, then t may have more than one roots), and (ii) for every node ni in t, there is a production X → a(r) ∈ P such that ν(ni) = X, λ(ni) = a, and that ν(ch(t, ni)) ∈ L(r), where L(r) is the language represented by r. A forest t is valid against G if there is an interpretation of t against G.

Example 1. Let G = (N, Σ, S, P ) be a regular tree grammar, where N = {T, N1, N2, F, L, D, A, C}, Σ = {staff , name, first , last , dept , address, pcdata}, n3a n8b n9b pcdata pcdata n17 n19 n4b n5b n 6b pcdatapcdata pcdata n11 n13 n15 add(a, n5 , n6 )

del(n 3)

S = {T }, P = {T → staff (N1D), N1 → name(F L), D → dept (N2A), F → first (C), L → last (C), N2 → name(C), A → address(C), C → pcdata(²)}. Let t be the tree in Fig. 1(a) and ν be the mapping illustrated in Fig. 1(c). Then ν is the interpretation of t against G.

Note that there is no DTD equivalent to G. G states that name has two distinct types; the name element of a staff element must consist of first and last elements and the name element of a dept element must consist only of a text string. Such a constraint cannot be represented by any DTD. 2

We use the following edit operations to transform a tree (Fig. 3 is an example). – ren(ni, a): Changes the label of an internal node ni to a ∈ Σ\{pcdata}. – del(ni): Deletes an internal node ni. We assume that no leaf node is deleted (i.e., the texts in a document are preserved). – add(a, nh, nj ): Adds a new node (denoted nh,j ) labeled by a ∈ Σ\{pcdata} as the parent of siblings nh, · · · , nj . – nil: An auxiliary edit operation stating that there is no valid edit operation to transform a tree.

Let t be a forest. An edit script for t (w.r.t. a set Σ of terminals) is a sequence of edit operations defined by the following (1)–(4). By s(t) we mean the forest obtained by applying each edit operation in an edit script s (from left to right) to t. If an edit script s contains an edit operation op, then we write op ∈ s. 1. ² is an edit script for t of length zero, where ²(t) is identical to t. 2. If s0 is an edit script for t, ni is an internal node in t, del(ni) ∈/ s0, and a ∈ Σ\{pcdata}, then s0ren(ni, a) is an edit script for t. 3. If s0 is an edit script for t, ni is an internal node in t, and del(ni) ∈/ s0, then s0del(ni) is an edit script for t. 4. If s0 is an edit script for t, nh and nj are preserved siblings in s0(t), s0 contains no duplicate edit operation of add(a, nh, nj ) (defined below), and a ∈ Σ\{pcdata}, then s0add(a, nh, nj ) is an edit script for t.

Due to the construction of the algorithm, we have two restrictions in (4). We say that nh and nj are preserved siblings if nh and nj are siblings in t with h ≤ j and nh and nj remain siblings in s0(t). We say that add(a0, nh0 , nj0 ) is a duplicate edit operation of add(a, nh, nj ) if nh0 = nh and nj0 = nj . (The latter restriction can be eliminated without difficulty, but details are omitted because of space limitation.)

Let G = (N, Σ, S, P ) be a regular tree grammar, t be a forest, and s be an edit script for t. We say that s is an edit script between t and G if s(t) is valid against G. Each edit operation o is associated with a cost denoted γ(o), where γ(o) ≥ 0. We assume that for any internal node ni and any a ∈ Σ\{pcdata}, γ(ren(ni, a)) = 0 if λ(ni) = a. We also assume that γ(nil) = ∞. The cost of an edit script s is defined as γ(s) = ∑o2s γ(o). For a positive integer K, if the following condition holds, then s1, s2, · · · , sK are K optimum edit scripts between t and G.

– For every 1 ≤ k ≤ K, sk is a kth optimum edit script between t and G; that is, sk is an edit script between t and G such that sk(t) ∈/ Sk¡1 and that γ(sk) ≤ γ(s) for any edit script s between t and G with s(t) ∈/ Sk¡1, where Sk¡1 = {s1(t), s2(t), · · · , sk¡1(t)} (we assume that if there is no such sk, then sk = nil).

In what follows, for simplicity we assume that the root of an XML document cannot be deleted and no new node can be added as the root. 3

NP-Hardness

In this section, we show that finding K optimum edit scripts between a tree and a regular tree grammar is NP-hard. For a tree t and edit scripts s, s0 for t, s is distinct to s0 if s(t) 6= s0(t). The Kth optimum edit script problem is to decide, for a tree t, a regular tree grammar G, and positive integers B and K, whether there are K or more distinct edit scripts s between t and G such that γ(s) ≤ B. We have the following theorem.

Theorem 1. The Kth optimum edit script problem is NP-hard even if only ren operation is allowed.

Proof(sketch): The theorem can be shown by reducing the Kth largest subset problem to the Kth optimum edit script problem. Details are omitted because of space limitation. 2

Thus, it is unlikely that we can find K optimum edit scripts between t and G in time polynomial in the size of t, the size of G, and log K. However, we show in the next section that the problem is solvable in pseudopolynomial time, i.e., in time polynomial in the size of t, the size of G, and K. 4

Overview of the Algorithm

Before presenting the algorithm formally, we show an overview of the algorithm. For simplicity, in this section we assume that K = 1. Let t be a tree and G be a regular tree grammar. In short, our algorithm works as follows. (a) pcdata pcdata b n4 n8 (b) ε

X,Y

X,Y m2 ε X,Y

X,Y X,Y

X,Y

X,Y m3 ε ε X,Y ε m4 ε m5 X,Y m6 m7 X,Y m8 for two edges mi →X mj and mi →Y mj. 1. Any node in t may have various sequences of children according to edit scripts applied to t. We first gather all such sequences into a single graph, denoted H1(t, N ). 2. For each node ni in t, we find an “optimum” sequence of children of ni by solving a shortest path problem over H1(t, N ). In fact, finding such a sequence leads to an optimum edit script for tni , as explained later. A node in a tree may have various sequences of children according to edit scripts applied to the tree. For example, let t be the tree in Fig. 4(a).

– Let s1 = ². Then ch(s1(t), n1) = n3, n4. – Let s2 = add(a, n3, n3). Then ch(s2(t), n1) = n3,3, n4.

– Let s3 = del(n3). Then ch(s3(t), n1) = n6, n4.

Here, let us represent each node ni (nh,j) by an edge mi¡1 → mi (resp., mh¡1 99K mj). Then a sequence of children is represented by a path, e.g., n3,3, n4 is represented by path m2 99K m3 → m4. For every internal node ni in t with ch(t, ni) = nh, · · · , nj, we also put a pair (mi¡1 →² mh¡1, mj →² mi) of edges. This pair is used to visit the children of ni when ni is deleted (e.g., ch(s3(t), n1) = n6, n4 is represented by a path m2 →² m5 → m6 →² m3 → m4). Let H(t) be a graph consisting of all such edges (Fig. 4(b)). H(t) is “complete”; for any edit script s for t and for any node ni in t with ch(t, ni) = nh, · · · , nj, H(t) contains a path from mh¡1 to mj that represents ch(s(t), ni).

Let G = (N, Σ, S, P ) be a regular tree grammar. Since we have to check if a transformed tree is valid against G, we also have to consider the non-terminal associated with each node. Thus we incorporate non-terminals into H(t). This is achieved as follows (assuming N = {X, Y }): replace each mi¡1 → mi (mh¡1 99K X mj) in H(t) by two edges mi¡1 →X mi and mi¡1 →Y mi (resp., mh¡1 99K mj and mh¡1 9Y9K mj) (Fig. 4(c)). For example, path m2 9X9K m3 →Y m4 represents a sequence n3,3, n4 of children such that X is associated with n3,3 and Y is associated with n4. The obtained graph is denoted H1(t, N ). Our algorithm is based on the following idea: we can find an optimum edit script for tni provided that we have an optimum edit script for every proper subtree of tni . In our algorithm, each such an edit script is associated with a corresponding edge in H1(t, N ). For example, let H1(t, N ) be the graph in Fig. 4(c).

X X Assume that for each edge mi¡1 → mi in H1(t, N ) except m0 → m1, we have a corresponding optimum edit script between tni and (N, Σ, {X}, P ), denoted X X σ(mi¡1 →X mi) (and that we also have σ(mh¡1 99K mj) for each mh¡1 99K mj).1

X Let us consider finding σ(m0 → m1), i.e., an optimum edit script between tn1 and (N, Σ, {X}, P ). For simplicity assume that X → a(r) is the only production X whose left hand side is X. Then since n1 must be labeled by a, σ(m0 → m1) must consist of (i) ren(n1, a) and (ii) an optimum edit script s0 between tn3,n4 and (N, Σ, N, P ) such that the roots of s0(tn3,n4 ) match r. To obtain s0, we use the following crucial property of H1(t, N ): for any path p from m2 to m4 in H1(t, N ), σ(p) is an optimum edit script for tn3,n4 w.r.t. the nodes represented by the path.2 For example, if p = m2 9X9K m3 →Y m4, then σ(p) is an optimum edit script between tn3,n4 and (N, Σ, N, P ), assuming that n3,3, n4 are the children of n1 and are associated with X and Y , respectively. This property and the completeness of H1(t, N ) imply that, in order to find s0, it suffices to find a path p from m2 to m4 such that

1. the non-terminals on p matches r, and that 2. the cost of σ(p) is the smallest of the paths satisfying (1).

Such a path can easily be found by solving a shortest path problem in the “intersection graph” of H1(t, N ) and an NFA representing r. 5

The Algorithm

In this section, we first show some preliminaries, then show the algorithm. 5.1

Preliminary Definitions We first define HK (t, N ) and show some related definitions. We next show some definitions to check if a path on HK (t, N ), representing a sequence of nodes, satisfies a given regular expression. 1 We also assume that for any pair (mi¡1 →² mh¡1, mj →² mi), σ(mi¡1 →² mh¡1) = del(ni) and σ(mj →² mi) = ². 2 If p = e1 · · · en, then σ(p) = σ(e1) · · · σ(en).

(a)

X,Y 1,2 m6 1X,,2Y εu εd X,Y

1,2 m10 1X,,2Y m11 m12 1X,,2Y m13

X,Y 1,2

X,Y X,Y 1,2 1,2

X,Y 1,2 X,Y 1,2 X,Y 1,2 m3 1,2

X,Y εu εd 1X,,2Y m7 m8 1X,,2Y m9 εu εu m4 εu stands for four edges mi →X1 mj, mi →X2 mj, mi →Y1 mj, mi →Y2 mj.

Graph HK (t, N ) Let t be a tree, G = (N, Σ, S, P ) be a regular tree grammar, and K be a positive integer. Let z = max{i | ni is a node in t}. HK (t, N ) is defined as a graph consisting of z + 1 nodes m0, m1, · · · , mz and the following edges (Fig. 5(a,b) is an example with N = {X, Y } and K = 2).

– For every node ni in t, every X ∈ N , and every 1 ≤ k ≤ K, HK (t, N ) has an edge mi¡1 →Xk mi, called node edge (n-edge). The non-terminal X means that ni is associated with X under some mapping ν, i.e., ν(ni) = X. – For every pair (nh, nj) of siblings in t (1 < h ≤ j), every X ∈ N , and every X 1 ≤ k ≤ K, HK (t, N ) has an edge mh¡1 99K mj. This is called add edge k (a-edge) and means adding a new node nh,j under some mapping ν such that ν(nh,j) = X. – For every internal node ni in t with ch(t, ni) = nh, · · · , nj, HK (t, N ) has a ²d mh¡1 and an upward ²-edge (²u-edge) downward ²-edge (²d-edge) mi¡1 →

²u mi. Pair (mi¡1 →²d mh¡1, mj →²u mi) is called ²-pair and represents mj → applying del(ni). nh,j (any ²-edge is skipped).

A path in HK (t, N ) represents a sequence of siblings. Let p = e1e2 · · · en be a path in HK (t, N ). We define l(p) = l(e1)l(e2) · · · l(en), where l(ei) is the label (non-terminal) of edge ei (²d and ²u are treated as empty strings). For example,

Y if p = m5 →²d m10 →X1 m11 →²u m6 99K m7, then l(p) = XY . Let q be a sequence of 2 siblings in a tree, ν be a mapping from every node on q to a non-terminal in N , and p be a path in HK (t, N ). Then p represents q under ν if (1) l(p) = ν(q) and (2) q coincides with a sequence of nodes obtained from p by replacing (i) each

X X n-edge mi¡1 →k mi on p with ni and (ii) each a-edge mh¡1 99K mj on p with k Example 2. Let us consider Fig. 5(b). Consider ch(t, n3) = n6, n7 and suppose X Y that ν(n6) = X and that ν(n7) = Y . Then path m5 →k1 m6 →k2 m7 represents (a) m7 εu ch(t, n3), for any k1, k2 ∈ {1, 2}. Changes made by add and del operations are represented by corresponding add and ²-edges (a path is not changed by any ren operation). For example, suppose that we apply s = add(a, n6, n6)del(n7) to t, then ch(s(t), n3) = n6,6, n13 (Fig. 5(c)). Assuming that ν(n6,6) = X and that

X Y ν(n13) = Y , path m5 9k93K m6 →²d m12 →k4 m13 →²u m7 represents ch(s(t), n3) for any k3, k4 ∈ {1, 2}.

X In order to compute edit script σ(mi¡1 →Xk mi) (σ(mh¡1 99K mj)), We have k to examine σ(e) for every edge e representing a descendant of ni (resp., nh,j). Let nh, nj be siblings in t (h ≤ j). By HK (t, N, h, j) we mean a graph consisting of the paths from mh¡1 to mj in HK (t, N ) (Fig. 6(a)). For an internal node ni with ch(t, ni) = nh, · · · , nj, HK (t, N, h, j) consists of the edges representing the descendants of ni. We also define HK0 (t, N, h, j), which is identical to HK (t, N, h, j) except that the a-edges from mh¡1 to mj are missing (Fig. 6(b)). HK0 (t, N, h, j) consists of the edges representing the descendants of nh,j. It is easy to show that the following lemma holds.

Lemma 1. HK (t, N, h, j) and HK0 (t, N, h, j) are “complete”; that is, for any edit script s for t and for any mapping ν from every node in s(t) to a nonterminal in N , the following conditions hold.

– For any node ni in s(t) with ch(t, ni) = nh, · · · , nj, HK (t, N, h, j) contains a path from mh¡1 to mj representing ch(s(t), ni) under ν. – For any node nh,j in s(t), HK0 (t, N, h, j) contains a path from mh¡1 to mj representing ch(s(t), nh,j) under ν. 2 2

The algorithm computes σ(e) for every edge e in HK (t, N ) from lower to higher edges; along a partial order defined as follows. By mi¡1 →X mi we mean the X representative of K n-edges mi¡1 →X1 mi, · · · , mi¡1 →KX mi. Similarly, mh¡1 99K X X mj is the representative of K a-edges mh¡1 99K mj, · · · , mh¡1 9K9K mj. For edge 1 representatives e1, e2 in HK (t, N ), we write “e1 ≺ e2” if e1 is lower than e2. Formally, “≺” is a partial order over the edge representatives in HK (t, N ) such that for any X ∈ N , – e ≺ mi¡1 →X mi if ni is an internal node in t with ch(t, ni) = nh, · · · , nj and e is the edge representative of an edge in HK (t, N, h, j), and that

X – e ≺ mh¡1 99K mj if e is the edge representative of an edge in HK0 (t, N, h, j).

X X X For example, in Fig. 5(b) m10 99K m11 ≺ m5 →X m6 and m5 99K m6 ≺ m5 99K

Y X m7, but m5 99K m6 6≺ m5 99K m6. (NI , Enode ∪ Eadd ∪ E²d ∪ E²u ), where Checking the Validity of a Node Let t be a tree, ni be a node with ch(t, ni) = nh, · · · , nj, and ν be a mapping from every node in s(t) to a non-terminal in N . In order to find edit scripts s such that ν(ch(s(t), ni)) matches a regular expression r, we find paths p from mh¡1 to mj in HK (t, N, h, j) such that l(p) matches r (if p = e1 · · · en, then σ(e1) · · · σ(en) is an edit script we are looking for). Such a path can be found as follows. An NFA is a five-tuple (Q, N, qs, F, δ), where Q is a set of states, N is a set of non-terminals, qs ∈ Q is the start state, F ⊆ Q is a set of final states, and δ : Q × N → 2Q is a transition function. For a regular expression r over N , by M (r) we mean an ²-free NFA such that L(r) = L(M (r)), where L(M (r)) is the language represented by M (r). Let M (r) = (Q, N, qs, F, δ) and let NH and EH be the sets of nodes and edges in HK (t, N, h, j), respectively. We define the intersection of HK (t, N, h, j) and M (r), denoted HK (t, N, h, j) × M (r), analogously to the intersection of two regular languages. Formally, HK (t, N, h, j) × M (r) is defined as a graph I =

NI = {(mi, q) | mi ∈ NH , q ∈ Q}, Enode = {(mi¡1, q) →X (mi, q0) | mi¡1 →Xk mi ∈ EH , q0 ∈ δ(q, X)}, k Eadd = {(mh¡1, q) 9X9K (mj, q0) | mh¡1 99kK mj ∈ EH , q0 ∈ δ(q, X)},

X k E²d = {(mi¡1, q) →²d (mj¡1, q) | mi¡1 →²d mj¡1 ∈ EH , q ∈ Q},

E²u = {(mj, q) →²u (mi, q) | mj →²u mi ∈ EH , q ∈ Q}. in pI . If (mi0 , qi0 ) = (mh¡1, qs) and (min , qin ) = (mj, qf ) for some qf ∈ F , then pI is called accepting (recall that mh¡1 (mj) is the “leftmost” node (resp., “rightmost” node) in HK (t, N, h, j)). By definition, there is a path p in HK (t, N, h, j) such that l(p) ∈ L(M (r)) iff there is an accepting path pI embedding p in HK (t, N, h, j) × M (r).

Let pI = (mi0 , qi0 ) X→k11 (mi1 , qi1 ) X→k22 · · · X→knn (min , qin ) be a path in I and mi1 ) · · · σ(min¡1 X→knn min ). For convention, σ(p) = nil if σ(mij¡1 X→kjj mij ) = nil for some 1 ≤ j ≤ n. γ(σ(pI )) is called the weight of pI . For an edit script s for tni , if ni is the root of s(tni ), then s is called root preserving. For an edit script s for tnh,nj , if nh,j is the root of s(tnh,nj ), then s is called root adding. The algorithm computes

X – for every n-edge mi¡1 → k

mi, a kth optimum root preserving edit script

X σ(mi¡1 →k mi) between tni and (N, Σ, {X}, P ), and In line 14, P (X) denotes the set of productions in P whose left-hand sides are X. For each production X → a(r) selected in line 14, lines 15 to 17 find K optimum root preserving edit scripts between tni and (N, Σ, {X}, P ) w.r.t. X → a(r)3 and add them to T . By definition, a root preserving edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r) consists of (i) ren(ni, a) and (ii) an edit script s0 between tnh,nj and (N, Σ, N, P ) such that the non-terminals on the roots of s0(tnh,nj ) match r under some interpretation. Line 16 can be solved by an algorithm for the K shortest paths problem, e.g., [ 6 ] (we assume that σ(pI k) is set to nil if there is no kth accepting shortest path pI k in I). Thus each σ(pI k) in line 17 represents a kth optimum edit script stated in (ii) above. Similarly, if

X X X eu = mh¡1 99K mj , then σ(mh¡1 99K mj ), · · · , σ(mh¡1 99K mj ) are obtained in 1 K lines 20 to 26.

We show the correctness of the algorithm.

Theorem 2. Let t be a tree, G = (N, Σ, S, P ) be a regular tree grammar, and K be a positive integer. Then the algorithm finds K optimum edit scripts between t and G.

Proof(sketch): Induction on the “height” of eu, from lower to higher edge representatives. Since any leaf node is labeled by pcdata, by lines 9 to 11 the basis case holds. As for the induction case, assume that for every edge e in HK (t, N ) “lower” than eu σ(e) is correctly obtained. Suppose that eu is set to

X X mi¡1 → mi for some internal node ni (the case where eu = mh¡1 99K mj can 3 Let s be a root preserving edit script between tni and (N, Σ, {X}, P ). For a production X → a(r), if λ(ni) = a in s(tni ) and ν(ch(s(tni ), ni)) ∈ L(r) under some interpretation ν, then we say that s is an edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r). 27. return K optimum edit scripts in {σ(m0 →Xk m1) | X ∈ S, 1 ≤ k ≤ K}. be shown similarly). It suffices to show that ren(ni, a)σ(pIk ) in line 17 is a kth optimum edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r). It holds that (i) by Lemma 1 HK (t, N, h, j) contains, for any edit script s for t and for any mapping ν, a path from mh¡1 to mj representing ch(s(t), ni) under ν and (ii) by the induction hypothesis for every edge e in HK (t, N, h, j) σ(e) is correctly obtained. This implies that σ(pIk ) found in line 17 is a kth optimum edit script between tnh,nj and (N, Σ, N, P ) such that the non-terminals on the roots of σ(pIk )(tnh,nj ) match r under some interpretation ν. Hence ren(ni, a)σ(pIk ) is a kth optimum edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r). 2

Consider the running time of the algorithm. Assume that t is a k-ary tree for some constant k. Then the algorithm runs in O(K · |N | · |t|2 · |P | · R2 + K log K) time, where |t| is the number of nodes in t, |P | is the number of productions in

U,L,Pcdata

1,2 U,L,Pcdata 1,2 U,L,Pcdata U,L,Pcdata

1,2 1,2 m2 U,L,1P,c2data m3 U,L1,P,2cdata m4 n3ul εd U,L,Pcdata εu εd U,L,Pcdata εu

1,2 1,2 m5U,L1,P,2cdatam6 m7U,L1,P,2cdatam8

n3,u3l n3li n4li

n4li n6,l6i n pcdata n pcdata n pcdata n pcdata 6 8 6 8 P , and R = maxX!a(r)2P (|r|) (|r| denotes the length of regular expression r). In particular, if the content model of each production is one-unambiguous [ 2 ], the algorithm runs in O(K · |N | · |t|2 · |P | · R + K log K) time (any content model of DTD and W3C XML Schema must be one-unambiguous).

Example 3. Let t be the tree in Fig. 8(a), K = 2, and G = (N, Σ, S, P ) be a regular tree grammar, where N = {U, L, Pcdata}, Σ = {ul, li, pcdata}, S = {U }, P = {U → ul (U ?L), L → li (Pcdata), Pcdata → pcdata(²)}. We assume for any nodes ni, nh, nj that γ(del(ni)) = γ(add(a, nh, nj)) = 2 and that γ(ren(ni, a)) = 1 (except that γ(ren(ni, a)) = 0 if λ(ni) = a). The σ-values obtained by the algorithm are listed in Table 1 (the edges associated with nil are omitted). For example, consider finding σ(m0 →Uk m1) for k = 1, 2. As listed below, HK (t, N, 3, 4) contains three paths p from m2 to m3 such that l(p) = U and that σ(p) 6= nil, and contains three paths p from m3 to m4 such that l(p) = L and that σ(p) 6= nil.

m2 →U1 m3

U m2 991K m3

U m2 992K m3 path from m2 to m3 cost path from m3 to m4 cost 2 0 3 5 m3 →L1 m4

L m3 991K m4

L m3 →²d m7 991K m8 →²u m4 3 4 Thus HK (t, N, 3, 4) contains 3 × 3 = 9 paths p from m2 to m4 such that l(p) ∈ L(U ?L) and that σ(p) 6= nil, where m2 →U1 m3 →L1 m4

U L is the shortest and m2 99K m3 →

1 1 lines 16 to 19 we obtain σ(m0 →U m1) = ren(n1, ul )σ(m2 →U 1 1

U m4) = ren(n1, ul)ren(n3, ul)add(li, n6, n6)ren(n4, li) and σ(m0 → 2 m4 is the second shortest. Thus in In this paper, we first showed that finding K optimum edit scripts between an XML document and a regular tree grammar is NP-hard. We next presented a pseudopolynomial-time algorithm for solving the problem. As a future work, we would like to implement the algorithm and making experiments on the algorithm.

Acknowledgement

This work is partially supported by the Grant-in-Aid for Young Scientists (B) #18700019.

[1]

Bertino , G. Guerrini, and

Mesiti . A matching algorithm for measuring the structural similarity between an xml document and a dtd and its applications . Information Systems , 29 ( 1 ): 23 - 46 , 2004 .

[2]

Bruggenmann-Klein and

Wood . One-unambiguous regular languages . Information and Computation , 142 ( 2 ): 182 - 206 , 1998 .

[3]

E. R.

Canfield and

Xing . Approximate xml document matching . In Proc. ACM SAC , pages 787 - 788 , 2005 .

[4]

Chawathe and

Gracia-Molina . Meaningful change detection in structured data . In Proc. ACM SIGMOD Conf. , pages 26 - 37 , 1997 .

[5]

Cobena ,

Abiteboul , and

Marian . Detecting changes in xml documents . In Proc. ICDE , pages 41 - 52 , 2002 .

[6]

Eppstein . Finding the k shortest paths . SIAM J. Comput. , 28 ( 2 ): 652 - 673 , 1998 .

[7]

Gusfield . Algorithms on Strings, Trees, and Sequences . Cambridge University Press, 1997 .

[8]

Murata ,

Lee ,

Mani , and

Kawaguchi . Taxonomy of xml schema languages using formal language theory . ACM TOIT , 5 ( 4 ): 660 - 704 , 2005 .

[9]

Nierman and

H. V.

Jagadish . Evaluating structural similarity in xml documents . In Proc. WebDB , pages 61 - 66 , 2002 .

[10]

Papakonstantinou and

Vianu . Dtd inference for view of xml data . In Proc. PODS , pages 35 - 46 , 2000 .

[11]

Sankoff and

Kruskal . Time Warps, String Edits, and Macromolecules . Addison-Wesley, 1983 .

[12]

Suzuki . Finding an optimum edit script between an xml document and a dtd . In Proc. ACM SAC , pages 647 - 653 (Journal version to apper in IPSJ Transactions on Databases) , 2005 .

[13]

R. A.

Wagner and J. I. Seiferas. Correcting counter-automaton-recognizable languages . SIAM J. Comput. , 7 : 357 - 375 , 1978 .

[14]

Zhang and

Shasha . Simple fast algorithms for the editing distance between trees and related problems . SIAM J. Comput. , 18 ( 6 ): 1245 - 1262 , 1989 .