<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding K Optimum Edit Scripts between an XML Document and a Regular Tree Grammar</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Library, Information and Media Studies University of Tsukuba 1-2, Kasuga</institution>
          ,
          <addr-line>Tsukuba, Ibaraki 305-8550</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Finding optimum and near optimum edit scripts between an XML document and a schema is essential to correcting invalid XML documents. In this paper, we consider finding K optimum edit scripts between an XML document and a regular tree grammar. We first prove that the problem is NP-hard. We next show a pseudopolynomial-time algorithm for solving the problem.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>An edit script between an (invalid) XML document and a schema is a sequence
of edit operations to transform the document into a valid one against the schema.
Thus finding such an edit script, especially an optimum one, is essential to
correcting invalid XML documents. On the other hand, we often have to find
near optimum edit scripts as well as an optimum one, since an optimum edit
script is not necessarily the real “best” solution. In this paper, we consider an
algorithm for finding K optimum edit scripts between an XML document and a
regular tree grammar.</p>
      <p>
        There are three common tree grammars to describe an XML schema; local
tree grammar (corresponding to DTD), single-type tree grammar (corresponding
to W3C XML Schema), and regular tree grammar (corresponding to RELAX
NG) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It is shown that single-type tree grammar is more expressive than local
tree grammar, and regular tree grammar is more expressive than single-type tree
grammar. Therefore, our algorithm is applicable to a local tree grammar and a
single-type tree grammar as well. The expressive power of regular tree grammar
is equivalent to that of specialized DTD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>An XML document is modeled as a labeled ordered tree, where a text node is
labeled by pcdata (e.g., Fig. 1(a)). An edit script is a sequence of edit operations,
where each edit operation is either ren (rename the label of a node), add (add a
new node), or del (delete a node). Each edit operation is associated with a cost.
Let t be a tree, G be a regular tree grammar, s be an edit script, and s(t) be
the tree obtained by applying s to t. We say that s is an edit script between t
and G if s(t) is valid against G. s1, s2, · · · , sK are called K optimum edit scripts
between t and G if for every 1 ≤ k ≤ K sk is an edit script between t and G
such that sk has the kth least cost among the edit scripts between t and G.
name n3
first n6 last n7 name
n9
n4dept
address
n10
(b)</p>
      <p>name n3
first n6 last n7 name
n9
n4dept
address
n10
pcdatan12pcdatan14pcdatan16pcdatan18
pcdatan12pcdatan14pcdatan16pcdatan18
(c)
n6F
C
n12
n9N2
C
n16</p>
      <p>D
n4</p>
      <p>A
n10</p>
      <p>C
n18</p>
      <p>We often have to correct invalid XML documents if we manage a document
database, since (i) documents and the schema on a database are changed over
time and (ii) documents migrated into the database are not necessarily valid
against the schema. Unfortunately, such documents and a schema may be much
complicated, and the amount of documents you need to correct is often very
large. Therefore, finding manually a correct transformation for each invalid
document is surely impractical. On the other hand, our algorithm can present K
optimum edit scripts between an XML document and a regular tree grammar,
which are reasonable solutions to correcting the document since the real “best”
solution is likely to be an optimum or near optimum edit script between the
document and the grammar.</p>
      <p>
        Finding an edit script between data has originally been studied in terms
of strings [
        <xref ref-type="bibr" rid="ref11 ref7">7, 11</xref>
        ]. Then there have been studies on finding the minimum edit
distance between a string and a language (e.g., [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). However, these studies
consider only strings and deal with neither a tree nor a tree grammar. So far many
algorithms that find an optimum edit script between two ordered trees or XML
documents have been proposed (e.g., [
        <xref ref-type="bibr" rid="ref14 ref4 ref5 ref9">4, 5, 9, 14</xref>
        ]). However, these algorithms
consider neither schema nor finding K optimum edit scripts. The algorithm in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
measures the distance between an unordered XML document (unordered tree)
and a DTD, and the algorithm in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] finds the minimum edit distance between
an XML document and a regular hedge grammar. However, these algorithms
only measures a distance and do not present any actual edit script. The author
proposes an algorithm for finding an optimum edit script between an XML
document and a DTD [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], but DTD is strictly less expressive than regular tree
grammar and the algorithm finds only one edit script rather than K optimum
edit scripts. Since an optimum edit script is not always the best solution, finding
K optimum edit scripts is a more practically important problem. For example,
let t be the tree in Fig. 2(a) and D be the DTD in Fig. 2(b). Assume that the
costs of ren, add, del are 1, 2, 3, respectively. Then the optimum edit script
between t and D is to rename n7 to i, the second optimum one is to add a new
node labeled by ul between n3 and n7, and the third one is to delete n3 or n7.
It is very likely that the second or the third one is the best solution.
      </p>
      <p>In this paper, we first prove that finding K optimum edit scripts between
an XML document and a regular tree grammar is NP-hard, then show a
pseudopolynomial-time algorithm for solving the problem.
An XML document is modeled as a labeled ordered tree. For simplicity,
attributes are not considered. In what follows, we use the term tree when we mean
labeled ordered tree. A leaf node represents a text node and an internal node
represents an element. Each node n is labeled by a symbol called terminal,
denoted λ(n). For example, λ(n1) = staff and λ(n3) = name in Fig. 1(a). We
assume that for any leaf node n, λ(n) = pcdata.</p>
      <p>For the simplicity of the algorithm, we denote each node of a tree t as follows
(Fig. 1(a) is an example).</p>
      <sec id="sec-1-1">
        <title>1. The root node of t is denoted n1. 2. Let ni be a node in t, and let nj be the node next to ni in breath-first order. If ni and nj are siblings, then j = i + 1, otherwise j = i + 2.</title>
        <p>No node in t is renumbered even if a new node is added to t or a node in t
is deleted. By tni we mean the subtree of t rooted at node ni. Similarly, for
consecutive siblings nh, · · · , nj in t, by tnh,nj we mean the subforest of t rooted
at nh, · · · , nj (Fig. 1(b) is an example). We say that the topmost nodes n3, n4
in Fig. 1(b) are the roots of tn3,n4 . By ch(t, ni), we mean the sequence of the
children of a node ni in a forest t (e.g., ch(t, n1) = n3, n4 in Fig. 1(a)).</p>
        <p>A regular tree grammar is a 4-tuple G = (N, Σ, S, P ), where N is a set of
non-terminals, Σ is a set of terminals, S ⊆ N is a set of start symbols, and P
is a set of productions of the form X → a(r) such that X ∈ N , a ∈ Σ, and
that r is a regular expression over N . For a production X → a(r), we say that
X is the left-hand side, a is the right-hand side, and r is the content model of
the production. Let t be a forest and ν be a mapping from every node n in
t to a non-terminal ν(n) ∈ N . For a sequence nh, · · · , nj of nodes, we define
that ν(nh, · · · , nj ) = ν(nh) · · · ν(nj ). We say that ν is an interpretation of t
against G if (i) ν(n) ∈ S for every root n of t (if t is a forest, then t may
have more than one roots), and (ii) for every node ni in t, there is a production
X → a(r) ∈ P such that ν(ni) = X, λ(ni) = a, and that ν(ch(t, ni)) ∈ L(r),
where L(r) is the language represented by r. A forest t is valid against G if there
is an interpretation of t against G.</p>
        <p>Example 1. Let G = (N, Σ, S, P ) be a regular tree grammar, where N =
{T, N1, N2, F, L, D, A, C}, Σ = {staff , name, first , last , dept , address, pcdata},
n3a
n8b n9b
pcdata pcdata
n17 n19
n4b n5b n 6b
pcdatapcdata pcdata
n11 n13 n15
add(a, n5 , n6 )</p>
        <p>del(n 3)</p>
        <p>S = {T }, P = {T → staff (N1D), N1 → name(F L), D → dept (N2A), F →
first (C), L → last (C), N2 → name(C), A → address(C), C → pcdata(²)}. Let t
be the tree in Fig. 1(a) and ν be the mapping illustrated in Fig. 1(c). Then ν is
the interpretation of t against G.</p>
        <p>Note that there is no DTD equivalent to G. G states that name has two
distinct types; the name element of a staff element must consist of first and last
elements and the name element of a dept element must consist only of a text
string. Such a constraint cannot be represented by any DTD. 2</p>
        <p>We use the following edit operations to transform a tree (Fig. 3 is an example).
– ren(ni, a): Changes the label of an internal node ni to a ∈ Σ\{pcdata}.
– del(ni): Deletes an internal node ni. We assume that no leaf node is deleted
(i.e., the texts in a document are preserved).
– add(a, nh, nj ): Adds a new node (denoted nh,j ) labeled by a ∈ Σ\{pcdata}
as the parent of siblings nh, · · · , nj .
– nil: An auxiliary edit operation stating that there is no valid edit operation
to transform a tree.</p>
        <p>Let t be a forest. An edit script for t (w.r.t. a set Σ of terminals) is a sequence
of edit operations defined by the following (1)–(4). By s(t) we mean the forest
obtained by applying each edit operation in an edit script s (from left to right)
to t. If an edit script s contains an edit operation op, then we write op ∈ s.
1. ² is an edit script for t of length zero, where ²(t) is identical to t.
2. If s0 is an edit script for t, ni is an internal node in t, del(ni) ∈/ s0, and
a ∈ Σ\{pcdata}, then s0ren(ni, a) is an edit script for t.
3. If s0 is an edit script for t, ni is an internal node in t, and del(ni) ∈/ s0, then
s0del(ni) is an edit script for t.
4. If s0 is an edit script for t, nh and nj are preserved siblings in s0(t), s0
contains no duplicate edit operation of add(a, nh, nj ) (defined below), and
a ∈ Σ\{pcdata}, then s0add(a, nh, nj ) is an edit script for t.</p>
        <p>Due to the construction of the algorithm, we have two restrictions in (4). We
say that nh and nj are preserved siblings if nh and nj are siblings in t with
h ≤ j and nh and nj remain siblings in s0(t). We say that add(a0, nh0 , nj0 ) is a
duplicate edit operation of add(a, nh, nj ) if nh0 = nh and nj0 = nj . (The latter
restriction can be eliminated without difficulty, but details are omitted because
of space limitation.)</p>
        <p>Let G = (N, Σ, S, P ) be a regular tree grammar, t be a forest, and s be an
edit script for t. We say that s is an edit script between t and G if s(t) is valid
against G. Each edit operation o is associated with a cost denoted γ(o), where
γ(o) ≥ 0. We assume that for any internal node ni and any a ∈ Σ\{pcdata},
γ(ren(ni, a)) = 0 if λ(ni) = a. We also assume that γ(nil) = ∞. The cost of
an edit script s is defined as γ(s) = ∑o2s γ(o). For a positive integer K, if
the following condition holds, then s1, s2, · · · , sK are K optimum edit scripts
between t and G.</p>
        <p>– For every 1 ≤ k ≤ K, sk is a kth optimum edit script between t and G;
that is, sk is an edit script between t and G such that sk(t) ∈/ Sk¡1 and
that γ(sk) ≤ γ(s) for any edit script s between t and G with s(t) ∈/ Sk¡1,
where Sk¡1 = {s1(t), s2(t), · · · , sk¡1(t)} (we assume that if there is no such
sk, then sk = nil).</p>
        <p>In what follows, for simplicity we assume that the root of an XML document
cannot be deleted and no new node can be added as the root.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>NP-Hardness</title>
      <p>In this section, we show that finding K optimum edit scripts between a tree and
a regular tree grammar is NP-hard. For a tree t and edit scripts s, s0 for t, s is
distinct to s0 if s(t) 6= s0(t). The Kth optimum edit script problem is to decide,
for a tree t, a regular tree grammar G, and positive integers B and K, whether
there are K or more distinct edit scripts s between t and G such that γ(s) ≤ B.
We have the following theorem.</p>
      <p>Theorem 1. The Kth optimum edit script problem is NP-hard even if only ren
operation is allowed.</p>
      <p>Proof(sketch): The theorem can be shown by reducing the Kth largest subset
problem to the Kth optimum edit script problem. Details are omitted because
of space limitation. 2</p>
      <p>Thus, it is unlikely that we can find K optimum edit scripts between t and G
in time polynomial in the size of t, the size of G, and log K. However, we show
in the next section that the problem is solvable in pseudopolynomial time, i.e.,
in time polynomial in the size of t, the size of G, and K.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Overview of the Algorithm</title>
      <p>Before presenting the algorithm formally, we show an overview of the algorithm.
For simplicity, in this section we assume that K = 1. Let t be a tree and G be
a regular tree grammar. In short, our algorithm works as follows.
(a)
pcdata
pcdata
b n4
n8
(b)
ε</p>
      <p>X,Y</p>
      <p>X,Y
m2
ε X,Y</p>
      <p>X,Y
X,Y</p>
      <p>X,Y</p>
      <p>X,Y
m3
ε ε X,Y
ε
m4
ε
m5 X,Y m6 m7 X,Y m8
for two edges mi →X mj and mi →Y mj.
1. Any node in t may have various sequences of children according to edit
scripts applied to t. We first gather all such sequences into a single graph,
denoted H1(t, N ).
2. For each node ni in t, we find an “optimum” sequence of children of ni
by solving a shortest path problem over H1(t, N ). In fact, finding such a
sequence leads to an optimum edit script for tni , as explained later.
A node in a tree may have various sequences of children according to edit scripts
applied to the tree. For example, let t be the tree in Fig. 4(a).</p>
      <p>– Let s1 = ². Then ch(s1(t), n1) = n3, n4.
– Let s2 = add(a, n3, n3). Then ch(s2(t), n1) = n3,3, n4.</p>
      <p>– Let s3 = del(n3). Then ch(s3(t), n1) = n6, n4.</p>
      <p>Here, let us represent each node ni (nh,j) by an edge mi¡1 → mi (resp.,
mh¡1 99K mj). Then a sequence of children is represented by a path, e.g.,
n3,3, n4 is represented by path m2 99K m3 → m4. For every internal node ni
in t with ch(t, ni) = nh, · · · , nj, we also put a pair (mi¡1 →² mh¡1, mj →² mi)
of edges. This pair is used to visit the children of ni when ni is deleted (e.g.,
ch(s3(t), n1) = n6, n4 is represented by a path m2 →² m5 → m6 →² m3 → m4).
Let H(t) be a graph consisting of all such edges (Fig. 4(b)). H(t) is “complete”;
for any edit script s for t and for any node ni in t with ch(t, ni) = nh, · · · , nj,
H(t) contains a path from mh¡1 to mj that represents ch(s(t), ni).</p>
      <p>Let G = (N, Σ, S, P ) be a regular tree grammar. Since we have to check if
a transformed tree is valid against G, we also have to consider the non-terminal
associated with each node. Thus we incorporate non-terminals into H(t). This is
achieved as follows (assuming N = {X, Y }): replace each mi¡1 → mi (mh¡1 99K
X
mj) in H(t) by two edges mi¡1 →X mi and mi¡1 →Y mi (resp., mh¡1 99K mj and
mh¡1 9Y9K mj) (Fig. 4(c)). For example, path m2 9X9K m3 →Y m4 represents
a sequence n3,3, n4 of children such that X is associated with n3,3 and Y is
associated with n4. The obtained graph is denoted H1(t, N ).
Our algorithm is based on the following idea: we can find an optimum edit script
for tni provided that we have an optimum edit script for every proper subtree
of tni . In our algorithm, each such an edit script is associated with a
corresponding edge in H1(t, N ). For example, let H1(t, N ) be the graph in Fig. 4(c).</p>
      <p>X X
Assume that for each edge mi¡1 → mi in H1(t, N ) except m0 → m1, we have
a corresponding optimum edit script between tni and (N, Σ, {X}, P ), denoted
X X
σ(mi¡1 →X mi) (and that we also have σ(mh¡1 99K mj) for each mh¡1 99K mj).1</p>
      <p>X
Let us consider finding σ(m0 → m1), i.e., an optimum edit script between tn1
and (N, Σ, {X}, P ). For simplicity assume that X → a(r) is the only production
X
whose left hand side is X. Then since n1 must be labeled by a, σ(m0 → m1)
must consist of (i) ren(n1, a) and (ii) an optimum edit script s0 between tn3,n4
and (N, Σ, N, P ) such that the roots of s0(tn3,n4 ) match r. To obtain s0, we use
the following crucial property of H1(t, N ): for any path p from m2 to m4 in
H1(t, N ), σ(p) is an optimum edit script for tn3,n4 w.r.t. the nodes represented
by the path.2 For example, if p = m2 9X9K m3 →Y m4, then σ(p) is an optimum
edit script between tn3,n4 and (N, Σ, N, P ), assuming that n3,3, n4 are the
children of n1 and are associated with X and Y , respectively. This property and
the completeness of H1(t, N ) imply that, in order to find s0, it suffices to find a
path p from m2 to m4 such that</p>
      <sec id="sec-3-1">
        <title>1. the non-terminals on p matches r, and that 2. the cost of σ(p) is the smallest of the paths satisfying (1).</title>
        <p>Such a path can easily be found by solving a shortest path problem in the
“intersection graph” of H1(t, N ) and an NFA representing r.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The Algorithm</title>
      <p>In this section, we first show some preliminaries, then show the algorithm.
5.1</p>
      <p>Preliminary Definitions
We first define HK (t, N ) and show some related definitions. We next show some
definitions to check if a path on HK (t, N ), representing a sequence of nodes,
satisfies a given regular expression.
1 We also assume that for any pair (mi¡1 →² mh¡1, mj →² mi), σ(mi¡1 →² mh¡1) =
del(ni) and σ(mj →² mi) = ².
2 If p = e1 · · · en, then σ(p) = σ(e1) · · · σ(en).</p>
      <p>(a)</p>
      <p>X,Y
1,2
m6 1X,,2Y
εu εd X,Y</p>
      <p>1,2
m10 1X,,2Y m11 m12 1X,,2Y m13</p>
      <p>X,Y
1,2</p>
      <p>X,Y
X,Y 1,2
1,2</p>
      <p>X,Y
1,2 X,Y
1,2
X,Y
1,2
m3 1,2</p>
      <p>X,Y
εu εd 1X,,2Y
m7 m8 1X,,2Y m9
εu
εu
m4
εu
stands for four edges mi →X1 mj, mi →X2 mj, mi →Y1 mj, mi →Y2 mj.</p>
      <p>Graph HK (t, N ) Let t be a tree, G = (N, Σ, S, P ) be a regular tree grammar,
and K be a positive integer. Let z = max{i | ni is a node in t}. HK (t, N ) is
defined as a graph consisting of z + 1 nodes m0, m1, · · · , mz and the following
edges (Fig. 5(a,b) is an example with N = {X, Y } and K = 2).</p>
      <p>– For every node ni in t, every X ∈ N , and every 1 ≤ k ≤ K, HK (t, N ) has
an edge mi¡1 →Xk mi, called node edge (n-edge). The non-terminal X means
that ni is associated with X under some mapping ν, i.e., ν(ni) = X.
– For every pair (nh, nj) of siblings in t (1 &lt; h ≤ j), every X ∈ N , and every
X
1 ≤ k ≤ K, HK (t, N ) has an edge mh¡1 99K mj. This is called add edge
k
(a-edge) and means adding a new node nh,j under some mapping ν such
that ν(nh,j) = X.
– For every internal node ni in t with ch(t, ni) = nh, · · · , nj, HK (t, N ) has a
²d mh¡1 and an upward ²-edge (²u-edge)
downward ²-edge (²d-edge) mi¡1 →</p>
      <p>²u mi. Pair (mi¡1 →²d mh¡1, mj →²u mi) is called ²-pair and represents
mj →
applying del(ni).
nh,j (any ²-edge is skipped).</p>
      <p>A path in HK (t, N ) represents a sequence of siblings. Let p = e1e2 · · · en be
a path in HK (t, N ). We define l(p) = l(e1)l(e2) · · · l(en), where l(ei) is the label
(non-terminal) of edge ei (²d and ²u are treated as empty strings). For example,</p>
      <p>Y
if p = m5 →²d m10 →X1 m11 →²u m6 99K m7, then l(p) = XY . Let q be a sequence of
2
siblings in a tree, ν be a mapping from every node on q to a non-terminal in N ,
and p be a path in HK (t, N ). Then p represents q under ν if (1) l(p) = ν(q) and
(2) q coincides with a sequence of nodes obtained from p by replacing (i) each</p>
      <p>X X
n-edge mi¡1 →k mi on p with ni and (ii) each a-edge mh¡1 99K mj on p with
k
Example 2. Let us consider Fig. 5(b). Consider ch(t, n3) = n6, n7 and suppose
X Y
that ν(n6) = X and that ν(n7) = Y . Then path m5 →k1 m6 →k2 m7 represents
(a)
m7
εu
ch(t, n3), for any k1, k2 ∈ {1, 2}. Changes made by add and del operations are
represented by corresponding add and ²-edges (a path is not changed by any ren
operation). For example, suppose that we apply s = add(a, n6, n6)del(n7) to t,
then ch(s(t), n3) = n6,6, n13 (Fig. 5(c)). Assuming that ν(n6,6) = X and that</p>
      <p>X Y
ν(n13) = Y , path m5 9k93K m6 →²d m12 →k4 m13 →²u m7 represents ch(s(t), n3) for
any k3, k4 ∈ {1, 2}.</p>
      <p>X
In order to compute edit script σ(mi¡1 →Xk mi) (σ(mh¡1 99K mj)), We have
k
to examine σ(e) for every edge e representing a descendant of ni (resp., nh,j).
Let nh, nj be siblings in t (h ≤ j). By HK (t, N, h, j) we mean a graph consisting
of the paths from mh¡1 to mj in HK (t, N ) (Fig. 6(a)). For an internal node
ni with ch(t, ni) = nh, · · · , nj, HK (t, N, h, j) consists of the edges
representing the descendants of ni. We also define HK0 (t, N, h, j), which is identical to
HK (t, N, h, j) except that the a-edges from mh¡1 to mj are missing (Fig. 6(b)).
HK0 (t, N, h, j) consists of the edges representing the descendants of nh,j. It is
easy to show that the following lemma holds.</p>
      <p>Lemma 1. HK (t, N, h, j) and HK0 (t, N, h, j) are “complete”; that is, for any
edit script s for t and for any mapping ν from every node in s(t) to a
nonterminal in N , the following conditions hold.</p>
      <p>– For any node ni in s(t) with ch(t, ni) = nh, · · · , nj, HK (t, N, h, j) contains
a path from mh¡1 to mj representing ch(s(t), ni) under ν.
– For any node nh,j in s(t), HK0 (t, N, h, j) contains a path from mh¡1 to mj
representing ch(s(t), nh,j) under ν.
2
2</p>
      <p>The algorithm computes σ(e) for every edge e in HK (t, N ) from lower to
higher edges; along a partial order defined as follows. By mi¡1 →X mi we mean the
X
representative of K n-edges mi¡1 →X1 mi, · · · , mi¡1 →KX mi. Similarly, mh¡1 99K
X X
mj is the representative of K a-edges mh¡1 99K mj, · · · , mh¡1 9K9K mj. For edge
1
representatives e1, e2 in HK (t, N ), we write “e1 ≺ e2” if e1 is lower than e2.
Formally, “≺” is a partial order over the edge representatives in HK (t, N ) such
that for any X ∈ N ,
– e ≺ mi¡1 →X mi if ni is an internal node in t with ch(t, ni) = nh, · · · , nj and
e is the edge representative of an edge in HK (t, N, h, j), and that</p>
      <p>X
– e ≺ mh¡1 99K mj if e is the edge representative of an edge in HK0 (t, N, h, j).</p>
      <p>X X X
For example, in Fig. 5(b) m10 99K m11 ≺ m5 →X m6 and m5 99K m6 ≺ m5 99K</p>
      <p>Y X
m7, but m5 99K m6 6≺ m5 99K m6.
(NI , Enode ∪ Eadd ∪ E²d ∪ E²u ), where
Checking the Validity of a Node Let t be a tree, ni be a node with ch(t, ni) =
nh, · · · , nj, and ν be a mapping from every node in s(t) to a non-terminal in
N . In order to find edit scripts s such that ν(ch(s(t), ni)) matches a regular
expression r, we find paths p from mh¡1 to mj in HK (t, N, h, j) such that l(p)
matches r (if p = e1 · · · en, then σ(e1) · · · σ(en) is an edit script we are looking
for). Such a path can be found as follows. An NFA is a five-tuple (Q, N, qs, F, δ),
where Q is a set of states, N is a set of non-terminals, qs ∈ Q is the start
state, F ⊆ Q is a set of final states, and δ : Q × N → 2Q is a transition
function. For a regular expression r over N , by M (r) we mean an ²-free NFA
such that L(r) = L(M (r)), where L(M (r)) is the language represented by M (r).
Let M (r) = (Q, N, qs, F, δ) and let NH and EH be the sets of nodes and edges
in HK (t, N, h, j), respectively. We define the intersection of HK (t, N, h, j) and
M (r), denoted HK (t, N, h, j) × M (r), analogously to the intersection of two
regular languages. Formally, HK (t, N, h, j) × M (r) is defined as a graph I =</p>
      <p>NI = {(mi, q) | mi ∈ NH , q ∈ Q},
Enode = {(mi¡1, q) →X (mi, q0) | mi¡1 →Xk mi ∈ EH , q0 ∈ δ(q, X)},
k
Eadd = {(mh¡1, q) 9X9K (mj, q0) | mh¡1 99kK mj ∈ EH , q0 ∈ δ(q, X)},</p>
      <p>X
k
E²d = {(mi¡1, q) →²d (mj¡1, q) | mi¡1 →²d mj¡1 ∈ EH , q ∈ Q},</p>
      <p>E²u = {(mj, q) →²u (mi, q) | mj →²u mi ∈ EH , q ∈ Q}.
in pI . If (mi0 , qi0 ) = (mh¡1, qs) and (min , qin ) = (mj, qf ) for some qf ∈ F , then
pI is called accepting (recall that mh¡1 (mj) is the “leftmost” node (resp.,
“rightmost” node) in HK (t, N, h, j)). By definition, there is a path p in HK (t, N, h, j)
such that l(p) ∈ L(M (r)) iff there is an accepting path pI embedding p in
HK (t, N, h, j) × M (r).</p>
      <p>Let pI = (mi0 , qi0 ) X→k11 (mi1 , qi1 ) X→k22 · · · X→knn (min , qin ) be a path in I and
mi1 ) · · · σ(min¡1 X→knn min ). For convention, σ(p) = nil if σ(mij¡1 X→kjj mij ) = nil
for some 1 ≤ j ≤ n. γ(σ(pI )) is called the weight of pI .
For an edit script s for tni , if ni is the root of s(tni ), then s is called root
preserving. For an edit script s for tnh,nj , if nh,j is the root of s(tnh,nj ), then s
is called root adding. The algorithm computes</p>
      <p>X
– for every n-edge mi¡1 →
k</p>
      <p>mi, a kth optimum root preserving edit script</p>
      <p>
        X
σ(mi¡1 →k mi) between tni and (N, Σ, {X}, P ), and
In line 14, P (X) denotes the set of productions in P whose left-hand sides are X.
For each production X → a(r) selected in line 14, lines 15 to 17 find K optimum
root preserving edit scripts between tni and (N, Σ, {X}, P ) w.r.t. X → a(r)3
and add them to T . By definition, a root preserving edit script between tni and
(N, Σ, {X}, P ) w.r.t. X → a(r) consists of (i) ren(ni, a) and (ii) an edit script
s0 between tnh,nj and (N, Σ, N, P ) such that the non-terminals on the roots
of s0(tnh,nj ) match r under some interpretation. Line 16 can be solved by an
algorithm for the K shortest paths problem, e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (we assume that σ(pI k) is
set to nil if there is no kth accepting shortest path pI k in I). Thus each σ(pI k)
in line 17 represents a kth optimum edit script stated in (ii) above. Similarly, if
      </p>
      <p>X X X
eu = mh¡1 99K mj , then σ(mh¡1 99K mj ), · · · , σ(mh¡1 99K mj ) are obtained in
1 K
lines 20 to 26.</p>
      <p>We show the correctness of the algorithm.</p>
      <p>Theorem 2. Let t be a tree, G = (N, Σ, S, P ) be a regular tree grammar, and K
be a positive integer. Then the algorithm finds K optimum edit scripts between
t and G.</p>
      <p>Proof(sketch): Induction on the “height” of eu, from lower to higher edge
representatives. Since any leaf node is labeled by pcdata, by lines 9 to 11 the
basis case holds. As for the induction case, assume that for every edge e in
HK (t, N ) “lower” than eu σ(e) is correctly obtained. Suppose that eu is set to</p>
      <p>X X
mi¡1 → mi for some internal node ni (the case where eu = mh¡1 99K mj can
3 Let s be a root preserving edit script between tni and (N, Σ, {X}, P ). For a
production X → a(r), if λ(ni) = a in s(tni ) and ν(ch(s(tni ), ni)) ∈ L(r) under some
interpretation ν, then we say that s is an edit script between tni and (N, Σ, {X}, P )
w.r.t. X → a(r).
27. return K optimum edit scripts in {σ(m0 →Xk m1) | X ∈ S, 1 ≤ k ≤ K}.
be shown similarly). It suffices to show that ren(ni, a)σ(pIk ) in line 17 is a kth
optimum edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r). It holds
that (i) by Lemma 1 HK (t, N, h, j) contains, for any edit script s for t and for any
mapping ν, a path from mh¡1 to mj representing ch(s(t), ni) under ν and (ii)
by the induction hypothesis for every edge e in HK (t, N, h, j) σ(e) is correctly
obtained. This implies that σ(pIk ) found in line 17 is a kth optimum edit script
between tnh,nj and (N, Σ, N, P ) such that the non-terminals on the roots of
σ(pIk )(tnh,nj ) match r under some interpretation ν. Hence ren(ni, a)σ(pIk ) is a
kth optimum edit script between tni and (N, Σ, {X}, P ) w.r.t. X → a(r). 2</p>
      <p>Consider the running time of the algorithm. Assume that t is a k-ary tree for
some constant k. Then the algorithm runs in O(K · |N | · |t|2 · |P | · R2 + K log K)
time, where |t| is the number of nodes in t, |P | is the number of productions in</p>
      <p>U,L,Pcdata</p>
      <p>1,2
U,L,Pcdata
1,2
U,L,Pcdata U,L,Pcdata</p>
      <p>1,2 1,2
m2 U,L,1P,c2data m3 U,L1,P,2cdata m4 n3ul
εd U,L,Pcdata εu εd U,L,Pcdata εu</p>
      <p>1,2 1,2
m5U,L1,P,2cdatam6 m7U,L1,P,2cdatam8</p>
      <p>n3,u3l
n3li
n4li</p>
      <p>
        n4li
n6,l6i
n pcdata n pcdata n pcdata n pcdata
6 8 6 8
P , and R = maxX!a(r)2P (|r|) (|r| denotes the length of regular expression r).
In particular, if the content model of each production is one-unambiguous [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
the algorithm runs in O(K · |N | · |t|2 · |P | · R + K log K) time (any content model
of DTD and W3C XML Schema must be one-unambiguous).
      </p>
      <p>Example 3. Let t be the tree in Fig. 8(a), K = 2, and G = (N, Σ, S, P )
be a regular tree grammar, where N = {U, L, Pcdata}, Σ = {ul, li, pcdata},
S = {U }, P = {U → ul (U ?L), L → li (Pcdata), Pcdata → pcdata(²)}. We
assume for any nodes ni, nh, nj that γ(del(ni)) = γ(add(a, nh, nj)) = 2 and that
γ(ren(ni, a)) = 1 (except that γ(ren(ni, a)) = 0 if λ(ni) = a). The σ-values
obtained by the algorithm are listed in Table 1 (the edges associated with nil are
omitted). For example, consider finding σ(m0 →Uk m1) for k = 1, 2. As listed
below, HK (t, N, 3, 4) contains three paths p from m2 to m3 such that l(p) = U and
that σ(p) 6= nil, and contains three paths p from m3 to m4 such that l(p) = L
and that σ(p) 6= nil.</p>
      <p>m2 →U1 m3</p>
      <p>U
m2 991K m3</p>
      <p>U
m2 992K m3
path from m2 to m3 cost path from m3 to m4 cost
2 0
3
5
m3 →L1 m4</p>
      <p>L
m3 991K m4</p>
      <p>L
m3 →²d m7 991K m8 →²u m4 3
4
Thus HK (t, N, 3, 4) contains 3 × 3 = 9 paths p from m2 to m4 such
that l(p) ∈ L(U ?L) and that σ(p) 6= nil, where m2 →U1 m3 →L1 m4</p>
      <p>U L
is the shortest and m2 99K m3 →</p>
      <p>1 1
lines 16 to 19 we obtain σ(m0 →U m1) = ren(n1, ul )σ(m2 →U
1 1</p>
      <p>U
m4) = ren(n1, ul)ren(n3, ul)add(li, n6, n6)ren(n4, li) and σ(m0 →
2
m4 is the second shortest. Thus in
In this paper, we first showed that finding K optimum edit scripts between an
XML document and a regular tree grammar is NP-hard. We next presented a
pseudopolynomial-time algorithm for solving the problem. As a future work, we
would like to implement the algorithm and making experiments on the algorithm.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work is partially supported by the Grant-in-Aid for Young Scientists (B)
#18700019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bertino</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mesiti</surname>
          </string-name>
          .
          <article-title>A matching algorithm for measuring the structural similarity between an xml document and a dtd and its applications</article-title>
          .
          <source>Information Systems</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bruggenmann-Klein</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wood</surname>
          </string-name>
          .
          <article-title>One-unambiguous regular languages</article-title>
          .
          <source>Information and Computation</source>
          ,
          <volume>142</volume>
          (
          <issue>2</issue>
          ):
          <fpage>182</fpage>
          -
          <lpage>206</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Canfield</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Xing</surname>
          </string-name>
          .
          <article-title>Approximate xml document matching</article-title>
          .
          <source>In Proc. ACM SAC</source>
          , pages
          <fpage>787</fpage>
          -
          <lpage>788</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawathe</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Gracia-Molina</surname>
          </string-name>
          .
          <article-title>Meaningful change detection in structured data</article-title>
          .
          <source>In Proc. ACM SIGMOD Conf.</source>
          , pages
          <fpage>26</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cobena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Marian</surname>
          </string-name>
          .
          <article-title>Detecting changes in xml documents</article-title>
          .
          <source>In Proc. ICDE</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>52</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Eppstein</surname>
          </string-name>
          .
          <article-title>Finding the k shortest paths</article-title>
          .
          <source>SIAM J. Comput.</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <fpage>652</fpage>
          -
          <lpage>673</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gusfield</surname>
          </string-name>
          . Algorithms on Strings,
          <source>Trees, and Sequences</source>
          . Cambridge University Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Murata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawaguchi</surname>
          </string-name>
          .
          <article-title>Taxonomy of xml schema languages using formal language theory</article-title>
          .
          <source>ACM TOIT</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>660</fpage>
          -
          <lpage>704</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nierman</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Evaluating structural similarity in xml documents</article-title>
          .
          <source>In Proc. WebDB</source>
          , pages
          <fpage>61</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          .
          <article-title>Dtd inference for view of xml data</article-title>
          .
          <source>In Proc. PODS</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sankoff</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kruskal</surname>
          </string-name>
          . Time Warps, String Edits, and
          <string-name>
            <surname>Macromolecules</surname>
          </string-name>
          . Addison-Wesley,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          .
          <article-title>Finding an optimum edit script between an xml document and a dtd</article-title>
          .
          <source>In Proc. ACM SAC</source>
          , pages
          <fpage>647</fpage>
          -
          <lpage>653</lpage>
          (Journal version to apper
          <source>in IPSJ Transactions on Databases)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          and
          <string-name>
            <surname>J. I. Seiferas.</surname>
          </string-name>
          <article-title>Correcting counter-automaton-recognizable languages</article-title>
          .
          <source>SIAM J. Comput.</source>
          ,
          <volume>7</volume>
          :
          <fpage>357</fpage>
          -
          <lpage>375</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Shasha</surname>
          </string-name>
          .
          <article-title>Simple fast algorithms for the editing distance between trees and related problems</article-title>
          .
          <source>SIAM J. Comput.</source>
          ,
          <volume>18</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1245</fpage>
          -
          <lpage>1262</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>