<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An XML-to-Relational User-Driven Mapping Strategy Based on Similarity and Adaptivity ¤</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>°c Irena Mlynkova</string-name>
          <email>irena.mlynkova@mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University Faculty of Mathematics and Physics Department of Software Engineering Malostranske nam.</institution>
          <addr-line>25 118 00 Prague 1</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the Spring Young Researcher's Colloquium on Database and Information Systems</institution>
          ,
          <addr-line>Moscow, Russia, 2007</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>As XML has become a standard for data representation, it is inevitable to propose and implement techniques for efficient managing of XML data. A natural alternative is to exploit features and functions of (object-)relational database systems, i.e. to rely on their long theoretical and practical history. The main concern of such techniques is the choice of an appropriate XML-to-relational mapping strategy. In this paper we focus on enhancing of userdriven techniques which leave the mapping decisions in hands of users. We propose an algorithm which exploits the user-given annotations more deeply searching the user-specified “hints” in the rest of the schema and applies an adaptive method on the remaining schema fragments. We describe the algorithm theoretically, discussing the key ideas of the approach, chosen solutions, their reasons, and consequences. Finally, we overview the open issues related to implementation of the proposed algorithm and its experimental testing on real XML data. ¤ This work was supported in part by the National Programme of Research (Information Society Project number 1ET100300419).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Without any doubt the eXtensible Markup Language
(XML) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is currently de-facto a standard for data
representation and manipulation. Its popularity is given by
the fact that the basic W3C recommendations are
welldefined, easy-to-learn, and at the same time still enough
powerful. The popularity naturally invoked a boom of
their efficient implementations based on various storage
strategies from the traditional ones such as file systems
to brand-new ones proposed particularly for XML
structures, so-called native approaches or directly native XML
databases.
      </p>
      <p>Probably the most natural and practically used
approach involves techniques which exploit features of
(object-)relational database systems, though they are not
as efficient as the native ones due to the key problem of
structural differences between XML data and relations.
The reason for the popularity is that relational databases
are still regarded as universal and powerful data
processing tools and their long theoretical and practical history
can guarantee a reasonable level of reliability and
efficiency. Contrary to native methods it is not necessary
to start “from scratch” but we can rely on a mature and
verified technology, i.e. properties that no native XML
database can offer yet. On this account we believe that
these methods and their possible improvements should
be further enhanced.</p>
      <p>
        The main concern of the database-based1 XML
techniques is the choice of an appropriate XML-to-relational
mapping strategy, i.e. the way the given XML data are
stored into relations. We can distinguish the following
three types approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
² fixed mapping methods based on predefined set of
mapping rules and appropriate heuristics (e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]),
² adaptive methods which adapt the target database
schema to the intended application (e.g. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]),
and
² methods which leave the mapping decisions in
hands of users (e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>
        The first set of methods can be further divided [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
into generic and schema-driven ones depending on
omitting or exploiting the existence of corresponding XML
schema. However, both the types use a straightforward
mapping strategy regardless the intended future usage.
On the contrary, adaptive methods automatically adapt
the database schema of a fixed method to the given
additional information. The best known representatives,
so-called cost-driven methods, usually search a space
of possible XML-to-relational mappings and choose the
one which conforms to the required application,
specified using a sample set of XML data and XML queries,
the most, i.e. where the provided queries over the given
data can be evaluated most efficiently. Finally, the last
1In the rest of the paper the term “database” represents an (object-)
relational database.
mentioned type of methods can be also divided into two
groups. We distinguish so-called user-defined and
userdriven methods [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which differentiate in the amount of
necessary user interaction. In the former case a user is
expected to define both the target database schema and
the required mapping strategy, i.e. to do all the work
“manually”. In the latter case a default mapping strategy
is defined but a user can specify local mapping changes
from the predefined set of other allowed mapping
strategies, usually using schema annotations. In other words,
the user-driven approach solves the main disadvantage of
the user-defined one – the requirement of a user skilled
in two complex technologies who is, in addition, able to
specify an optimal database schema for a particular
application. Note that the user-driven techniques can be
regarded as a type of adaptive methods too [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], since
they also adapt the default target schema to additional
information, in particular to user-specified requirements.
      </p>
      <p>In this paper we focus on further enhancing of
userdriven techniques, particularly on their (in our opinion)
two main persisting disadvantages. The first one is the
fact that the default mapping strategy is (to our
knowledge) always a fixed one. It is quite a surprising finding
since we know that the proposed systems are able to store
schema fragments in various ways. From this point of
view an adaptive enhancing of the fixed method seems
to be quite natural and suitable. The second key
shortcoming we are dealing with is weak exploitation of the
user-given information. We believe that the schema
annotations a user provides can not only be directly applied
on particular schema fragments, but the information they
carry can be further exploited. The main idea is quite
simple – we regard the annotations as “hints” how a user
wants to store particular XML patterns and we use this
information twice again. Firstly, we search for similar
patterns in the rest of the schema and store the found
fragments in a similar way. And secondly, we exploit
the information in the adaptive strategy for not annotated
parts of the schema.</p>
      <p>To sum up, the main contribution of this paper is a
proposal of an algorithm which enhances classical
userdriven strategies using the following two approaches:
² a deeper exploitation of the information carried in
user-given schema annotations and
² an adaptive mapping strategy for not annotated parts
of the schema.</p>
      <p>We describe the proposed algorithm theoretically. We
discuss the key ideas and problems, their possible
solutions, reasons for our particular decisions, and their
consequences. Finally, we overview the related problems
and open issues of a particular implementation of the
proposal.</p>
      <p>The rest of the paper is structured as follows: Section
2 contains a motivation for focusing on user-given
information. Section 3 overviews the existing related works in
the area of user-driven methods, adaptive methods, and
similarity of XML data. In the fourth section we describe
and discuss the proposed algorithm in more detail and in
the fifth section we sum up the corresponding
implementation open issues. Finally, Section 6 provides
conclusions and outlines our future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>The key concern of this paper is to exploit the user-given
information as much as possible. We result from the idea
of user-driven enhancing of the user-defined techniques,
where a user is expected to help the mapping process, not
to perform it. We want to go even farther. But first of all
we discuss why user-given information is so important to
deal with.</p>
      <p>
        A simple demonstrative example can be a set of XML
documents which contain various XHTML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
fragments. A classical fixed schema-driven mapping
strategy (e.g. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) would decompose the fragments into
a number of relations. Since we know that the standard
XHTML DTD allows, e.g., complete subgraphs on up to
10 nodes, the reconstruction of such fragments would be
a really expensive operation in terms of the number of
join operations. But if we knew that the real complexity
of such fragments is much simpler (and the analysis of
real XML data shows that it is quite probable [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]), e.g.
that each of the fragments can be described as a simple
text with tags having the depth of 2 at most, we could
choose a much simpler storage strategy including the
extreme one – a CLOB column.
      </p>
      <p>Another example can be the crucial feature of
database storage strategies – updatability of the data. On
one hand, we could know that the data will not be
updated too much or at all but we need an effective query
evaluation. On the other hand, there could be a strong
demand for effective data updates, whereas the queries
are of marginal importance. And there are of course
cases which require effective processing of both.
Naturally, the appropriate storage strategies differ strongly.
In case of effective query processing a number of indices
and numbering schemes can be exploited but at the cost
of corresponding expensive updates. Effective updates,
conversely, require the simplest information of mutual
data relationships. And if both the aspects are required,
it is unavoidable to compromise. And such decision can
be again made correctly only if we have an appropriate
information on the required future usage.</p>
      <p>
        Last but not least, let us consider the question of data
redundancy. Without any additional information the
optimal storage strategy is so-called 4NF schema
decomposition into relations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where 4NF stands for the fourth
normal form, which can be achieved, e.g., using the
classical Hybrid algorithm [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], a representative of fixed
mapping methods. The decomposition does not involve
data redundancy or violation of any normal form, i.e. it
results in a database schema with the lowest number of
relations and null attributes. But, similarly to database
design, there can be reasonable real-world cases when
the data should not strictly follow the rules of normal
forms and their moderation can lead to more effective
query processing.
      </p>
      <p>
        Both the cost-driven and user-driven methods are
based on the idea of exploiting additional user-given
information and they appropriately adapt the target
database schema. In the former case it is extracted from
a sample set of XML documents and/or XML queries
which characterize the typical future usage, in the
latter case it is specified by user-given annotations, i.e. the
user directly specifies the required changes of the default
mapping. But although there is a plenty of existing
representatives of the two approaches, there are still numerous
weak points and open issues that should be improved and
solved [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>As mentioned above, our first improvement is
searching for identical or similar fragments in the not annotated
schema parts. This approach has two main advantages:
1. The user is not forced to annotate all schema
fragments that have to be stored alternatively, but only
those with different structure. Thus the system is
not endangered of unintended omitting of
annotating all similar cases.
2. The system can reveal structural similarities which
are not evident “at first glance” and which could
remain hidden to the user.</p>
      <p>Thus the first main concern of our proposal is how to
identify identical or similar fragments within the schema.</p>
      <p>Our second enhancing focuses on the choice of the
mapping strategy for schema fragments which were
neither annotated by the user, nor identified as fragments
similar to the annotated ones. In this case we combine
the idea of cost-driven methods with the fact that a
userdriven technique should support various storage
strategies too. Hence our second concern is how to find the
optimal mapping strategy for the remaining schema
fragments and, in addition, with exploitation of the
information we already have, i.e. the user-specified annotations,
as much as possible.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        As we have mentioned, methods which involve a user in
the mapping process can be divided into user-defined and
user-driven. Probably due to simple implementation the
former ones are supported in most commercial database
systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. On the other hand, the set of techniques of
the latter type is surprisingly small. To our knowledge
there are just two main representatives of the approach
– so-called Mapping Definition Framework (MDF) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and XCacheDB System [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Both support inlining and
outlining of an element / attribute, mapping an element
/ attribute to a CLOB column, renaming target tables /
columns, and redefining column data types. The former
approach furthermore supports the Edge mapping [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
strategy and enables to specify the required capturing of
the structure of the whole schema. The latter one, in
addition, allows a certain degree of redundancy.
      </p>
      <p>
        In both the cases the mapping for not annotated parts
is fixed and the annotations are applied just directly on
the annotated schema fragments. The two ideas we want
to use for their enhancing are adaptivity [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and
similarity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
3.1
      </p>
      <sec id="sec-3-1">
        <title>Adaptive XML-to-Relational Mapping</title>
        <p>
          Probably the first proposal of an adaptive cost-driven
method can be found in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. It is based on the idea
of storing well structured parts of XML documents
into relations (using the 4NF decomposition) and
semistructured parts using an XML data type, which supports
path queries and XML-aware full-text operations. The
main concern of the method is to identify the structured
and semi-structured parts. For this purpose a sample set
of XML documents and XML queries is used.
        </p>
        <p>
          The other existing cost-driven approaches [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] use a different strategy. They define a set of
XMLto-XML transformations (e.g. inlining / outlining of an
element / attribute, splitting / merging of a shared
element2, associativity, commutativity, etc.), a fixed
XMLto-relational mapping, and a cost function which
evaluates a relational schema against a given sample set of
XML data and/or queries. Using a search algorithm a
space of possible relational schemes is searched and the
optimal one is selected. Since it can be proven that even
a simple set of transformations causes the problem to
be NP-hard, the corresponding search algorithms in fact
search for suboptimal solutions and exploit, e.g.,
heuristics, terminal conditions, approximations, etc.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Similarity of XML Data</title>
        <p>Exploitation of similarity of XML data can be found in
various XML technologies, such as, e.g., document
validation, query processing, data transformation, storage
strategies based on clustering, data integration systems,
dissemination-based applications, etc. Consequently, the
number of existing works is enormous. We can search
for similarity among XML documents, XML schemes,
or between the two groups. Furthermore, we can
distinguish several levels of similarity that can be taken into
account during the search process – a structural level (i.e.
considering only the structure of the given XML
fragments), a semantic level (i.e. taking into account also
the meaning of element / attribute names), a constraint
level (i.e. taking into account also various text value
constraints), etc.</p>
        <p>
          In case of document similarity we distinguish
techniques expressing the similarity of two documents D1
and D2 by measuring how difficult is to transform D1
into D2 (e.g. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]) and techniques which specify a simple
and reasonable representation of D1 and D2 that enables
their efficient comparison and similarity evaluation (e.g.
[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]). In case of similarity of document D and schema S
there are also two types of strategies – techniques which
measure the number of elements which appear in D but
not in S and vice versa (e.g. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) and techniques which
measure the closest distance between D and all
documents valid against S (e.g. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]). Finally, methods for
measuring similarity of two XML schemes S1 and S2
exploit and combine various supplemental information and
measures such as, e.g., predefined similarity rules,
similarity of element / attribute names, equivalence of data
types and structure, schema instances, thesauri, previous
results, etc. (e.g. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ])
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Proposed Algorithm</title>
      <p>
        The general idea of fixed schema-driven
XML-torelational mapping methods is to decompose the given
XML schema S into a set of schema fragments
F = ff1; f2; :::; fng. Each fi 2 F is then mapped to
a corresponding relation ri using a mapping strategy
sfrag. The relationship between fragments is kept
using a mapping strategy srel. In other words, the 3-tuple
2An element with multiple parent elements in the schema – see [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
¿ = &lt; F; sf rag ; srel &gt; specifies a particular
XML-torelational mapping method. An extreme case is when
S is considered as a single fragment which results in a
single relation, usually with many null values. Other
extreme occurs when each element in S is considered as a
single fragment resulting in a huge number of relations
and thus numerous join operations.
      </p>
      <p>Note that fixed generic methods can be described
using the 3-tuple ¿ as well. Each of the techniques views
an XML document as general directed tree with several
types of nodes. And the view can be considered as a
special kind of XML schema.</p>
      <p>In user-driven strategies the three characteristics are
influenced by user-defined annotations which specify
how a particular user wants to store selected fragments
F 0 = ff10 ; f20 ; :::; f m0g. The user usually provides S
with annotating attributes from the predefined set of
attribute names §A0 , which represent various fixed
mapping strategies, resulting in an annotated schema S0.
Obviously, annotated fragments in F 0 do not have to
correspond to schema fragments in F . For instance, a
user may specify that whole S should be stored using a
4NF decomposition and thus F 0 = fS0g is a singleton,
whereas corresponding F can contain several schema
fragments depending on the complexity of S. Also note
that while F should cover whole S, F 0 usually does not.</p>
      <p>A classical user-driven strategy consists of the
following steps:
1. S is provided with annotations from §A0 resulting
in S0 and F 0.
2. Annotated fragments from F 0 are decomposed
into corresponding schema fragments F1 =
ff1; f2; :::; fkg.
3. Not annotated fragments of S0 are decomposed into
schema fragments F2 = ffk+1; fk+2; :::; fng using
a default (predefined or user-defined) fixed strategy
sdef .
4. F = F1 [ F2 is mapped to R using sf rag and srel.</p>
      <p>Using this notation our proposed enhancing simply
adds the following steps between the step 1 and 2:
a. For 8 f 0 2 F 0 we identify a set Ff0 0 of all fragments
similar to f 0 occurring in S0nff 0g.
b. For 8 f 0 2 F 0 all fragments in Ff0 0 are annotated with
annotating attributes of f 0 and F 0 = F 0 [ Ff0 0 .
c. S0nF 0 is annotated using an adaptive strategy which
adds new annotated fragments to F 0.</p>
      <p>The mapping process is schematically depicted in
Figure 1 where the given schema S with two annotated
fragments f and g is mapped to a database schema R. If the
proposed enhancing (i.e. steps 1.a – 1.c) is included, the
system gradually identifies and adds new annotated
fragments f1, f2, g1, g2, and g3 which are mapped using
user-required mapping strategies. If the enhancing is not
included (i.e. in case of a classical user-driven strategy),
only fragments f and g are annotated using user-required
strategies and the rest of the schema using sdef .
4.1</p>
      <sec id="sec-4-1">
        <title>Searching for Similar Fragments</title>
        <p>As we have mentioned, there are numerous approaches to
measuring similarity of XML data. Nevertheless, most
of them cannot be directly used for our case since our
demanded key characteristics differ. In particular, we
search for similarity within the scope of a single schema,
the similarity measure should not depend on similarity
of element / attribute names but primarily on
complexity of content models, and the similarity measure cannot
obviously depend on the context of fragments.</p>
        <p>Considering the problem in more depth, several
fundamental questions arise:
1. How are the annotated fragments defined?</p>
        <sec id="sec-4-1-1">
          <title>2. What types of annotations, i.e.</title>
          <p>strategies, are supported?
fixed mapping
3. What measure is used for measuring similarity of
two schema fragments?
4. Can we optimize the exhaustive search strategy?
The answers for the questions mutually influence each
other and specify the algorithm. Furthermore, the
definition of annotated fragments together with the question of
their mutual intersection are closely related to supported
mapping strategies.
4.1.1</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Annotated Fragments</title>
        <p>
          First of all, for easier processing we define a graph
representation of an XML schema S, no matter if annotated
or not. For easier explanation we assume that the given
XML schema S is expressed in DTD3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], nevertheless
the algorithm can be applied to schemes expressed using,
e.g., XML Schema [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] language as well.
Definition 1 A schema graph of an XML schema S is
a directed, labelled graph GS = (V; E; §E ; §A; lab),
where
² V is a finite set of nodes,
² E µ V £ V is a set of edges,
² §E is a set of element names in S,
² §A is a set of attribute names in S, and
² lab : V ! §E [ §A [ f“j”, “*”, “+”, “?”, “,”g [
fpcdatag is a surjective function which assigns a
label to 8v 2 V .
        </p>
        <p>Definition 2 A fragment of a schema S is each subgraph
of S consisting of an element e, all its descendants, and
corresponding edges.</p>
        <p>© is a set of all fragments of S.</p>
        <p>Next, we assume that each annotated fragment f 0 2
F 0 is uniquely determined by the element e which was
annotated using an annotating attribute a 2 §A0 .
Definition 3 An annotated element e0 of schema S0 is an
element provided with an annotated attribute from §A0 .</p>
        <p>3We omit supplemental constructs such as entities, CDATA
sections, comments, etc.
Definition 4 An annotated fragment f 0 of schema S0 is a
fragment of S0 rooted at an annotated element e0
excluding all annotating attributes from §A0 .</p>
        <p>
          As we want to support shared elements and recursion,
since both the constructs are widely used in real XML
data [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we must naturally allow the annotated
fragments to intersect almost arbitrarily. To simplify the
situation, we define an expanded schema graph, which
exploits the idea that both the constructs purely indicate
repeated occurrence of a particular pattern.
        </p>
        <p>Definition 5 An expanded schema graph GeSx is a result
of the following transformations of schema graph GS :
1. Each shared element is duplicated for each sharer
using a deep copy operation, i.e. including all its
descendants and corresponding edges.
2. Each recursive element is duplicated for each
repeated occurrence using a shallow copy operation,
i.e. only the element node itself is duplicated.</p>
        <p>An illustrative example of a schema graph GS and its
expanded schema graph GeSx is depicted in Figure 2. A
shared element is highlighted using a dotted rectangle, a
recursive element is highlighted using a dotted circle.</p>
        <p>
          As it is obvious, in case of shared elements the
expansion is lossless operation. It simply omits the key
advantage of shared elements which allows reusing of
previously defined schema fragments. The situation is more
complicated in case of recursive elements which need to
be treated in a special way henceforth. For this purpose
we exploit results of statistical analysis of real-world
recursive elements [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], particularly the fact that the most
common type of recursion is linear4 and that the number
4Consisting of a single recursive element that does not branch out.
of repetitions of a recursive element is small, on average
less than 5. The two findings enable to treat recursive
elements not as elements with theoretically infinite depth,
but in a much simpler way. We discuss the details later
in the text.
        </p>
        <p>In the following text we assume that a schema graph
of an XML schema is always an expanded schema graph,
if not explicitly stated alternatively.
4.1.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Types of Annotations</title>
        <p>From Definitions 4 and 5 we can easily prove the
following two statements:
Lemma 1 Each expanded schema graph GeSx is a tree.
Lemma 2 Two annotated fragments f x0 and fy0 of an
expanded schema graph GeSx0 can intersect only if f x0 µ fy0
or f x0 µ fy0 .</p>
        <p>Furthermore, we can observe that the common schema
fragment, i.e. the intersection, contains all descendants
of a particular element.</p>
        <p>We distinguish three types of the annotation
intersection depending on the way the corresponding mapping
strategies influence each other.</p>
        <p>Definition 6 Intersecting annotations are redundant if
the corresponding mapping strategies are applied on the
common schema fragment separately.</p>
        <p>Definition 7 Intersecting annotations are overriding if
only one of the corresponding mapping strategies is
applied on the common schema fragment.</p>
        <p>Definition 8 Intersecting annotations are influencing if
the corresponding mapping strategies are combined
resulting in one composite storage strategy applied on the
common schema fragment.</p>
        <p>Redundant annotations can be exploited, e.g., when
a user wants to store XHTML fragments both in a
single CLOB column (for fast retrieval of the whole
fragment) and, at the same time, into a set of tables (to enable
querying particular items). An example of overriding
annotations can occur when a user specifies a general
mapping strategy for the whole schema S and then annotates
fragments which should be stored differently. Naturally,
in this case the strategy which is applied on the common
schema fragment is always the one specified for its root
element. The last mentioned type of annotations can be
used in a situation when a user specifies, e.g., the 4NF
decomposition for a particular schema fragment and at
the same time an additional numbering schema which
speeds up processing of particular types of queries. In
this case the numbering schema is regarded as a
supplemental index over the data stored in relations of 4NF
decomposition, i.e. the data are not stored redundantly as
in the first case.</p>
        <p>
          We assume that each pair of annotations implemented
in a corresponding system is assigned its intersection
type or, if necessary, a particular combination of
annotations can be denoted as forbidden. The existing
systems [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] mostly support overriding annotations, the
XCacheDB system [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], in addition, supports a type of
redundant intersection similar to the above described
example. Furthermore, we assume that the implemented
annotations are assigned priorities which specify the
order in which they are composed when applied on
common schema fragment.
4.1.3
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Search Algorithm</title>
        <p>The similarity measure, the search algorithm, and its
possible optimization are closely related. However, the main
idea of the enhancing of user-driven techniques remains
the same regardless the chosen measure and algorithm.
The choice of the measure influences the precision and
scalability of the system, whereas the algorithm
influences the efficiency of finding the required fragments.
In case of “classical” adaptive methods this can be a
marginal aim, since the schema adaptation is performed
only once, before the schema is created. But its
importance significantly rises when the system needs to be
dynamic and adapt continuously.</p>
        <p>Let us suppose that we have a similarity measure
sim(fx; fy ) 2 [0; 1] expressing similarity of two
fragments fx and fy of an expanded graph, where 1
represents strong similarity and 0 strong dissimilarity, and
a similarity threshold Tsim 2 [0; 1]. A naive strategy
would exploit an exhaustive search as depicted by
Algorithm 1.</p>
        <p>The algorithm evaluates similarity of each annotated
fragment f 0 2 F 0 and each fragment fe rooted at an
element e 2 S0nff 0g. We can assume that if the storage
strategy for any fe should not change, the user would
mark it as final. For such fragment either the default
strategy sdef or the strategy specified by corresponding
user-defined annotation will be used, regardless the
results of the search or adaptive algorithm. As such this
situation is rather the problem of implementation than a
theoretical one and thus we further assume that there are
no such fragments in S0. On the other hand, we
naturally regard fragments annotated by a user to be final by
default.</p>
        <p>The resulting similarity values are stored into
socalled similarity matrix fsim[f 0; fe]gf 02F 0; e2S0 . An
element e is annotated if there exists a fragment fmax 2 F 0
with the highest similarity value sim(fmax; fe) &gt; Tsim.
In Algorithm 1 we assume that there is always one such
candidate at most. Otherwise, the system can ask for
user intervention when necessary. We call this approach
a single annotation strategy (SAS).</p>
        <sec id="sec-4-4-1">
          <title>Algorithm 1 Single Annotation Strategy (SAS)</title>
          <p>Input: S0, F 0, sim(fx; fy ), Tsim
Output: F 00, i.e. F 0 [ newly annotated fragments
fconstruction of the similarity matrixg
1: F 00 Ã F 0
2: for all f 0 2 F 0 do
3: for all element e 2 S0 do
4: fe Ã subgraph rooted at e
5: if e 2 S0nff 0g then
6: sim[f 0; fe] Ã sim(f 0; fe)
7: else
8: sim[f 0; fe] Ã 0
9: end if
10: end for
11: end for</p>
          <p>fannotation strategyg
12: for all element e 2 S0 do
13: maxe Ã maxf 02F 0 fsim[f 0; fe]g
14: fmax Ã f 0 2 F 0 s.t. sim[f 0; fe] = maxe
15: if maxe &gt; Tsim then
16: fe.annotation Ã fmax.annotation
17: F 00 Ã F 00 [ ffeg
18: end if
19: end for
20: return F 00</p>
          <p>The question is whether this approach is correct
actually. From another point of view it could be more
reasonable to annotate an element e using annotations of
all fragments f 0 2 F 0 s.t. sim(f 0; fe) &gt; Tsim.
Together with the assumption that for each pair of
annotations the result of their composition is predefined and
that the annotations have priorities according to which
they are composed, this approach seems to be a better
choice since it does not omit important information. But,
on the other hand, let us consider the situation depicted
in Figure 3, where for i 2 f1; 2; 3g sim(f 0; fi) &gt; Tsim
and f 0 is the annotated fragment.
The problem is whether we can annotate all the three
fragments f1; f2; f3 using the annotation of f 0,
especially what will be the result of intersection in case of
f1 and f3 or f2 and f3, i.e. fragments occurring on
the same root path5. We can naturally assume that
intersection of two identical annotations is overriding and
as such has no effect. Thus we could annotate only the
topmost fragment on each root path. In case of example
in Figure 3 this rule would be applied twice, resulting
in a single annotation of fragment f3. But what if we
knew, in addition, that sim(f 0; f1) &gt; sim(f 0; f3) and
sim(f 0; f2) &gt; sim(f 0; f3)? As it is obvious, in such
case it is seems to be more reasonable and natural to
annotate fragments f1 and f2 rather than whole f3. Or this
5A path from the root node to a leaf node.
situation can be again a case for user intervention,
depending on the point of view of it. We will further
consider the former one.</p>
          <p>If we generalize the idea, the algorithm annotates an
element e using annotations of all fragments f 0 2 F 0
s.t. sim(f 0; fe) &gt; Tsim and 6 9 element e0 on any root
path traversing e s.t. sim(f 0; fe0 ) &gt; sim(f 0; fe). The
resulting algorithm, so-called multiple annotation
strategy (MAS), is depicted by Algorithm 2, where e:ancs
denotes a set of (direct or undirect) ancestors of element
e and e:descs denotes a set of (direct or undirect)
descendants of e. The process of construction of the similarity
matrix remains the same as in case of Algorithm 1.
Algorithm 2 Multiple Annotation Strategy (MAS)
Input: S0, F 0, sim(fx; fy ), Tsim
Output: F 00, i.e. F 0 [ newly annotated fragments
1: F 00 Ã F 0</p>
          <p>fconstruction of the similarity matrixg
2:
-//</p>
          <p>fannotation strategyg
3: for all f 0 2 F 0 do
4: for all element e 2 S0 do
5: if (sim[f 0; fe] &gt; Tsim) ^
(6 9ea 2 e:ancs : sim[f 0; fea ] &gt; sim[f 0; fe]) ^
(6 9ed 2 e:descs : sim[f 0; fed ] &gt; sim[f 0; fe])
then
6: fe.annotation Ã f 0.annotation
7: F 00 Ã F 00 [ ffeg
8: end if
9: end for
10: end for
11: return F 00</p>
          <p>Using this approach we should consider what will
happen in case a user annotates two structurally
identical (or too similar) fragments using different annotations.
We cannot simply rely on predefined type of their
intersection and corresponding priorities, because the
situation is a slightly different one. In this case the system
should rather ask for user intervention whenever it is not
able to decide. And this is again a problem of the
particular implementation.
4.1.4</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Similarity Measure and Optimization of the</title>
      </sec>
      <sec id="sec-4-6">
        <title>Search Algorithm</title>
        <p>Now let us consider the search strategy from the point
of view of complexity of the algorithm. Figure 4 depicts
an example of processing a single annotated fragment, in
particular the amount of similarity comparisons.
Annotated fragments f and g are highlighted using rectangles,
all schema fragments, which are compared with f are
highlighted using dotted ovals.</p>
        <p>If we do not know any features of the measure, there
are not many ways how to avoid the exhaustive search.
Also the order in which fragments in F 0 are processed
is then unimportant. But although we can assume that
card(F 0) = m is small, i.e. that a user annotates
several fragments but the number is not large, the
exhaustive search can be expensive due to the size of GeSx. And
even from the simple example in Figure 4 it is obvious
that there are pairs of schema fragments which do not
have to be compared at all. Another problem is the
complexity of the if condition of Algorithm 2 (line 5) which
can in the worst case lead to multiple searching through
the whole GeSx. So in both the cases we need to avoid the
unnecessary similarity evaluations.</p>
        <p>
          It seems promising to borrow the idea of
clustering, similarly to paper [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], where the distance between
schema fragments is determined by their mutual
similarity, e.g. dist(fx; fy ) = 1 ¡ sim(fx; fy ). An example is
depicted in Figure 5 for the sample schema in Figure 4.
        </p>
        <p>All schema fragments (depicted using black-filled
circles) are divided into clusters C1, C2,..., Ck (depicted
using black circular lines) having their centroids c1, c2,...,
ck and radii r1, r2,..., rk (or one common radius r,
depending on the implementation). Having this
information, only those schema fragments have to be compared
with fragment f , whose clusters intersect the cluster with
centroid f and radius Tsim. In case of Figure 5 these are
clusters C1 and C2. Obviously, if the clusters were
selected appropriately, the amount of comparisons would
decrease rapidly. Hence the key concern of all clustering
algorithms is mainly the construction of the clusters.</p>
        <p>
          The construction is usually performed using a
kmeans algorithm or its variations (e.g. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]), where the
initial clusters are selected randomly and then iteratively
improved. In the i-th iteration each fragment is
compared with centroids of all clusters and assigned to the
closest one. The algorithm terminates if none of the
clusters changes, otherwise new centroids are computed and
(i + 1)-th iteration follows. The complexity of the
construction is O(I ¢ j©j ¢ k), where I is the number of
iterations. In case of complexity of similarity evaluation
the worst case is when either k = 1 or all k clusters
mutually intersect, i.e. when we cannot avoid any of
the similarity comparisons. Hence in the worst case the
number of comparisons is the same as in the exhaustive
search strategy and the complexity can worsen only the
pre-processing, i.e. the construction of clusters. And this
is the step we want to remove too.
        </p>
        <p>
          For further optimization we can exploit
characteristics of the chosen similarity measure. The existing
algorithms for measuring similarity on schema level usually
exploit various supplemental matchers [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], i.e.
functions which evaluate similarity of a particular feature of
the given schema fragments, such as, e.g., similarity of
number and types of leaf nodes, similarity of root
element names, similarity of context, etc.
        </p>
        <p>Definition 9 A matcher is a function m : ©2 ! [0; 1]
which evaluates similarity of a particular feature of two
schema fragments fx; fy 2 ©.</p>
        <p>Definition 10 A partial similarity measure is a function
mpart : ©2 ! [0; 1]p which evaluates similarity of
the given schema fragments fx; fy 2 © using matchers
m1; m2; :::; mp : ©2 ! [0; 1] and returns a p-tuple of
their results.</p>
        <p>Then the partial results are combined using an
appropriate approach, usually a kind of a weighted sum, into
the resulting composite similarity value.</p>
        <p>Definition 11 A composite similarity measure is a
function mcomp : [0; 1]p ! [0; 1] which combines the results
of particular matchers and returns the total similarity
value.</p>
        <p>Due to the features of selected partial matchers the
existing techniques usually exploit a bottom-up strategy,
i.e. starting from leaf nodes towards the root node.
Together with the previously mentioned problem of
similar intersecting fragments this is why we need to know
the behavior of the similarity measure on particular root
paths.</p>
        <p>For instance, if we knew that the similarity measure
is concave, i.e. that it has only one global maximum,
we could skip processing of all the ancestors on the
current root path whenever we reach the fragment with
the extreme value. A sample situation can be seen in
Figure 6 which depicts an example of a graph of
similarity function for an annotated fragment f 0 and
fragments f1; f2; :::; fr on a single root path. From the
graph we can see, that only fragments f1; f2; f3; f4 need
to be processed (f4 for testing the extremity), then the
similarity evaluation can terminate, skipping fragments
f5; f6; :::; fr.</p>
        <p>As it is obvious, this way we can decrease the
number of unnecessary similarity evaluations as well as avoid
pre-processing of the schema and expensive checking of
the if condition of Algorithm 2. Naturally, the efficiency
of such approach depends strongly on the position of the
extreme on the root path. The key problem is how to
define such similarity measure. For our purpose we need
a measure which focuses especially on the structure of
the compared fragments, known equivalences or
relations between regular expressions, differences between
simple and complex types, etc., i.e. on features that
influence the efficiency of database processing the most.
But it is hard, if not impossible, to propose a measure
with concave behavior which is at the same time enough
precise in relation to these requests. Nevertheless, we
can exploit a relaxed version of this idea as a kind of
heuristic of the bottom-up strategy.</p>
        <p>Although we can hardly ensure that mcomp is
concave, we can assume that at least q of the matchers, where
1 · q · p, have this property. Without loss of generality
we suppose that these are m1; m2; :::; mq. For instance a
trivial matcher with such behavior can compare the
number of distinct element or attribute names, the amount of
similar operators, the depth of the corresponding
regular expression, etc. The heuristic is then based on the
idea that if at least “sufficient amount” of the q matchers
exceed their extreme value, we can terminate processing
of the current root path too. There are just two
differences from the previously mentioned idea. Firstly, the
exceeding of the particular extreme is expressed using
a threshold Tex 2 [0; 1] which guarantees that
processing of the current root path does not terminate neither
too soon (to reach the optimum of the composite
similarity measure) nor too late (to avoid unnecessary
similarity evaluations). Secondly, as it is obvious, the matchers
themselves are not precise in terms of similarity
evaluation. Hence each of them is assigned a user-specified
reliability r1; r2; :::; rq 2 [0; 1], where 0 expresses strong
unreliability and 1 strong reliability of the matcher. The
reliabilities then influence the real value of threshold Tex
for particular matchers multiplying its value.</p>
        <p>The whole optimization of the approach, so-called
basic annotation strategy (BAS), is depicted by
Algorithm 3, where function terminate returns true if the
search algorithm should terminate in the given node,
otherwise it returns false. Furthermore, we assume that each
element of the graph is assigned an auxiliary list of
candidates consisting of pairs &lt; f ragment; similarity &gt;,
i.e. references to fragments (and corresponding
similarity values) within its subtree that are candidates for
annotation.</p>
        <p>The algorithm processes schema graph starting from
leaf nodes. For each root path the optimal similarity
value and the reference to corresponding fragment are
propagated until a better candidate is found or the
condition of the heuristic is fulfilled. Then the processing
of the current root path is terminated and current
candidates are annotated. The complexity of the algorithm
depends on the heuristics. In the worst case it does not
enable to skip processing of any node that results in the
exhaustive search. But in contrast to the clustering
approach we have avoided the expensive preprocessing of
the algorithm.</p>
        <p>In general we could use an arbitrary similarity
measure, not exactly the above defined composite one. It
is also possible to use disjoint sets of matchers for the
heuristic and for the composite similarity measure.
Nevertheless, we will deal with the above described ones,
since it is the typical and verified way for evaluating
similarity among XML schemes.
4.1.5</p>
      </sec>
      <sec id="sec-4-7">
        <title>Recursive Elements</title>
        <p>
          Last but not least, we have to solve the open problem of
expanded recursive elements, since the expansion is not
a lossless operation as in case of shared elements. As
we have already indicated, we will exploit the results of
analysis of real-world XML data which shows two
important aspects [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]:
1. Despite it is generally believed that recursive
elements are of marginal importance, they are used in
a significant portion of real XML data.
2. Although the recursive elements can have arbitrarily
complex structure, the most common type of
recursion is linear and the average depth of recursion is
low.
        </p>
        <p>If we realize that we need the “lost” information about
recursion only at one stage of the algorithm, the solution
is quite obvious. We analyze the structure of schema
fragments when evaluating matchers m1; m2; :::; mp,
whereas each of the matchers describes similarity of a
particular feature of the given fragments. In case the
fragments contain recursive elements we will not use the
exact measure, but its approximation with regard to the
real complexity of recursive elements. For instance if
the matcher analyzes the maximum depth of fragment
containing a recursive element, the resulting depth is not
infinite, but considers the average depth of real-world
recursive elements.</p>
        <p>The question is whether it is necessary to involve
a matcher which analyzes the amount of recursive
elements in schema fragments. On one hand it can increase
the precision of the composite measure. But from
another point of view the approximation transforms the
recursive element to a “classical” element and hence such
matcher can be misleading.
4.2</p>
      </sec>
      <sec id="sec-4-8">
        <title>Adaptive Mapping Strategy</title>
        <p>At this stage of the algorithm we have a schema S0 and
a set of annotated fragments F 0 which involve the
userdefined fragments and fragments identified by Algorithm
3. As the second enhancing we want to apply an adaptive
mapping strategy to the remaining parts of the schema.</p>
        <p>As mentioned previously, the key idea of adaptive
strategies is to adapt the target relational schema R to
the expected future application, which is specified by
a sample set of XML data and/or XML queries. The
techniques define a set of XML-to-XML transformations
which produce a space of equivalent or more general
XML schemes. A search algorithm is used to find a
schema, where the evaluation of the given queries is most
efficient. Thus at first glance the user-driven techniques
have nothing in common with the adaptive ones. Or, we
could expect that the user provides not only a set of
annotations, but also sample XML documents and XML
queries. Then we could use a classical adaptive strategy
for not annotated parts of schema S0. Such
combination should not be difficult considering the fact that
userdriven techniques enable to store various schema
fragments in various ways. Nevertheless, the key
shortcoming of such approach is that the user is expected to
provide too many information.</p>
        <p>Under a closer investigation we can see that the
usergiven annotations provide a similar information – they
say how particular schema fragments should be stored
to enable efficient data querying and processing. Thus
we can simply reuse the user-given information. For this
purpose we define an operation contraction which
enables to omit those schema fragments, where we already
know the storage strategy, and focus on the remaining
ones.</p>
        <p>Definition 12 A contraction of an expanded schema
graph GeSx0 with annotated fragment set F 0 is an
operation which replaces each fragment f 0 2 F 0 with a single
auxiliary node called a contracted node.</p>
        <p>The resulting graph is called a contracted graph GcSo0n.</p>
        <p>From Definitions 4 and 12 we can easily prove the
following simple statement which enables to reuse the
search algorithm on the contracted graph.</p>
        <p>Lemma 3 Contracted graph GcSo0n of a connected
expanded schema graph GeSx0 is connected.</p>
        <p>The basic idea of the adaptive strategy is as follows:
Having a contracted graph GcSo0n we repeat the search
algorithm (in particular its slight modification) and
operation contraction until there can be found any fragment
to annotate. Contrary to the previous situation the search
algorithm has the following differences:
² It searches for schema fragments which are not
involved in the schema, i.e. it searches among all
nodes of the given (contracted) graph and returns
the (eventually empty) set of found fragments.
² For similarity evaluation we do not take into
account contracted nodes, i.e. neither their current,
nor their previous structure or any other features.
These were already analyzed and processed in
previous steps of the algorithm.
² The annotations of contracted nodes are always
overriding in relation to the newly defined ones.
² The required threshold is more precise, i.e. we use
a threshold Tcon &lt; Tsim.</p>
        <p>We denote this modification of BAS as a
contractionaware annotation strategy (CAS). (We omit its formal
description for obviousness and the paper length.) The
resulting annotating strategy, so called global annotation
strategy (GAS), is depicted by Algorithm 4, where
function contract applies operation contraction on expanded
graph of the given schema and corresponding fragments,</p>
        <sec id="sec-4-8-1">
          <title>Algorithm 3 Basic Annotation Strategy (BAS)</title>
          <p>and function expand contractions expands all the
contracted nodes of the given schema to the original ones.</p>
          <p>The resulting complexity of the algorithm depends on
the number of iterations of the cycle (lines 3 – 7). In the
worst case each iteration results in annotating of a single
element, i.e. the search algorithm repeats (j©j ¡ jF 0j + 1)
times.</p>
          <p>Considering the whole approach, we are especially
interested in the efficiency of the resulting
XML-torelational storage strategy which can be verified only
through appropriate tests on real XML data. We discuss
it in the following section.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Open Issues</title>
      <p>In the previous section we have described and discussed
the proposed algorithm on theoretical level. We have
mentioned several possible solutions and discussed their
consequences and disadvantages as well as reasons for
the choices we have made. But despite the detailed
description there are still several open issues. We
distinguish two main categories:
1. Features of the particular implementation and</p>
      <sec id="sec-5-1">
        <title>2. Behavior of the algorithm on real XML data.</title>
        <p>As for the former case the key implementation
decisions are especially:
² the set of supported schema annotations, types of
their mutual intersection (or forbiddance), and their
priorities,
² the matchers m1, m2, :::, mq, mq+1, :::, mp and
corresponding reliabilities r1, r2, :::, rq,
² the composite similarity measure mcomp, and
² the thresholds Tsim, Tex, and Tcon.</p>
        <p>The key problem lies especially in tuning of
reliabilities and thresholds, since both influence the precision and
efficiency of the system strongly. The remaining
characteristics are related rather to its usefulness and versatility.
There are also marginal questions such as, e.g., whether
the system will support final elements or user
intervention if there are more candidates for a particular
situation. But these features do not have the key influence on
the proposed approach itself.</p>
        <p>
          The latter category of open issues is quite
unpredictable, despite the existing statistics of real XML data
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. It is caused mainly by two facts. Firstly, although
we know the usual characteristics of the real data, we
cannot predict especially the behavior of more complex
similarity measures due to the above mentioned tuning
of the characteristics. And secondly, we cannot predict
the behavior of the proposed adaptive strategy, since we
have no information about the structure of contracted
graphs of real data. Furthermore, the choice of
particular schema fragments will be strongly related to the type
of the tested data and thus the efficiency of the resulting
storage strategy can vary remarkably.
        </p>
        <p>As it is obvious, both the categories are also related
significantly. And though some particular features can be
estimated or preset according to the existing user-driven
systems and statistical analysis of data, most of them will
still require a series of experimental tests.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The main aim of this paper was to illustrate that since the
idea of database-based XML processing methods is still
up-to-date, the techniques should and can be further
enhanced. On this account we have proposed a user-driven
mapping algorithm which is able to exploit the user given
information, i.e. schema annotations, more deeply and,
at the same time, to find the mapping strategy for the
not annotated parts more efficiently – using an adaptive
approach. We have described and discussed the
algorithm on theoretical level as well as summed up the
corresponding open issues related to particular
implementation decisions.</p>
      <p>
        A natural next step of our work is the implementation
of the proposed algorithm and especially experimental
tests of its behavior on real data. For this purpose we
are enhancing our previous implementation of a fixed
XML-to-relational mapping strategy [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which exploits
object-oriented features of XML Schema language
together with features of object-relational database
systems. For the testing we will exploit information about
categories of real XML data and their typical features
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as well as existing related works which could help
with tuning the system.
      </p>
      <p>A possible further enhancing of our approach can be
found in focussing on statistically frequent XML schema
patterns. They can be used in several ways – e.g. as
a reasonable default setting or as XML patterns with a
higher priority.</p>
      <p>Another improvement can be an exploitation of the
semantic of element and/or attribute names. The similarity
can be searched not only on structural level, but using a
kind of thesaurus or appropriate user-given information.
Although our proposal focuses mainly on structural
similarities related to efficiency of database processing, it is
at least worth testing whether the semantic of the names
carries additional important information useful for this
purpose too.</p>
      <p>Next interesting task could be also a combination of
our approach with a real cost-driven one, i.e. an
exploitation of both user-given annotations and a sample set of
XML data and XML queries together. As we have
already mentioned, the key disadvantage is in the amount
of required input data. But, on the other hand, the
combination of the two approaches could bring interesting
results or the efficiency of the two approaches can be at
least compared.</p>
      <p>
        And last but not least, the key enhancing lies in
dynamic adaptability of the system [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This challenging
but non-trivial task would solve the remaining
disadvantage of the adaptive methods – the fact that the schema
is adapted only once, at the beginning but not in case the
application changes.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] XHTML 1</source>
          .0 The
          <string-name>
            <surname>Extensible HyperText Markup Language (Second Edition). W3C Recommendation</surname>
          </string-name>
          ,
          <year>August 2002</year>
          . http://www.w3.org/TR/ xhtml1/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          .
          <article-title>Storage Techniques and Mapping Schemas for XML</article-title>
          .
          <source>Technical Report TD-5P4L7B</source>
          , AT&amp;T Labs-Research,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Du</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          .
          <article-title>A Comprehensive Solution to the XML-to-Relational Mapping Problem</article-title>
          .
          <source>In WIDM'04: Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>38</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          .
          <article-title>Overview of Existing XML Storage Techniques</article-title>
          . AT&amp;
          <string-name>
            <surname>T LabsResearch</surname>
          </string-name>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Balmin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          .
          <article-title>Storing and Querying XML Data Using Denormalized Relational Databases</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>30</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bertino</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mesiti</surname>
          </string-name>
          .
          <article-title>A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its Applications</article-title>
          . Inf. Syst.,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Biron</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          .
          <source>XML Schema Part</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Datatypes (Second Edition). W3C Recommendation</surname>
          </string-name>
          ,
          <year>October 2004</year>
          . www.w3.org/TR/ xmlschema-2/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bohannon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Roy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Simeon</surname>
          </string-name>
          .
          <article-title>From XML Schema to Relations: A Cost-based Approach to XML Storage</article-title>
          .
          <source>In ICDE '02: Proceedings of the 18th International Conference on Data Engineering</source>
          , pages
          <fpage>64</fpage>
          -
          <lpage>75</lpage>
          , Washington, DC, USA,
          <year>2002</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Sperberg-McQueen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Maler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Yergeau. Extensible Markup</surname>
          </string-name>
          <article-title>Language (XML) 1.0 (Fourth Edition)</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <year>September 2006</year>
          . http://www.w3. org/TR/REC-xml/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Do</surname>
          </string-name>
          and
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          <article-title>COMA - A System for Flexible Combination of Schema Matching Approaches</article-title>
          .
          <source>In VLDB'02: Proceedings of the 28th International Conference on Very Large Data Bases</source>
          , pages
          <fpage>610</fpage>
          -
          <lpage>621</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China,
          <year>2002</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          .
          <article-title>Storing and Querying XML Data using an RDMBS. IEEE Data Eng</article-title>
          . Bull.,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          and H. Meyer. XML and
          <string-name>
            <surname>ObjectRelational Database</surname>
          </string-name>
          Systems - Enhancing
          <source>Structural Mappings Based on Statistics. In Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases</source>
          , pages
          <fpage>151</fpage>
          -
          <lpage>170</lpage>
          , London, UK,
          <year>2001</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Generic Schema Matching with Cupid</article-title>
          .
          <source>In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases</source>
          , pages
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          , San Francisco, CA, USA,
          <year>2001</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mlynkova</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          .
          <article-title>From XML Schema to Object-Relational Database - an XML Schema-Driven Mapping Algorithm</article-title>
          .
          <source>In ICWI'04: Proceedings of IADIS International Conference WWW/Internet</source>
          , pages
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          , Madrid, Spain,
          <year>2004</year>
          . IADIS.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mlynkova</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          .
          <article-title>Adaptability of Methods for Processing XML Data using Relational Databases - the State of the Art and Open Problems</article-title>
          .
          <source>In RCIS'07: Proceedings of the 1st International Conference on Research Challenges in Information Science</source>
          (to appear), Ouarzazate, Morocco,
          <year>April 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mlynkova</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          .
          <article-title>Exploitation of Similarity and Pattern Matching in XML Technologies</article-title>
          .
          <source>Technical report 2006/13</source>
          . Charles University, Prague, Czech Republic,
          <year>November 2006</year>
          . http://kocour.ms.mff.cuni.cz/ ~mlynkova/doc/tr2006-
          <fpage>13</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mlynkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          .
          <article-title>Statistical Analysis of Real XML Data Collections</article-title>
          .
          <source>In COMAD'06: Proceedings of the 13th International Conference on Management of Data</source>
          , pages
          <fpage>20</fpage>
          -
          <lpage>31</lpage>
          , New Delhi, India,
          <year>2006</year>
          . Tata
          <string-name>
            <surname>McGraw-Hill Publishing</surname>
          </string-name>
          Company Limited.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>P. K.L. Ng</surname>
            and
            <given-names>V. T.Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Structural Similarity between XML Documents and DTDs</article-title>
          .
          <source>In ICCS'03: Proceedings of the International Conference on Computational Science</source>
          , pages
          <fpage>412</fpage>
          -
          <lpage>421</lpage>
          . Springer Berlin / Heidelberg,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nierman</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Evaluating Structural Similarity in XML Documents</article-title>
          . In
          <source>WebDB'02: Proceedings of the Fifth International Workshop on the Web and Databases</source>
          , pages
          <fpage>61</fpage>
          -
          <lpage>66</lpage>
          , Madison, Wisconsin, USA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>A Survey of Approaches to Automatic Schema Matching</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ):
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shanmugasundaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tufte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. He,
          <string-name>
            <surname>D. J. DeWitt</surname>
            , and
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Naughton</surname>
          </string-name>
          .
          <article-title>Relational Databases for Querying XML Documents: Limitations and Opportunities</article-title>
          .
          <source>In VLDB'99: Proceedings of 25th International Conference on Very Large Data Bases</source>
          , pages
          <fpage>302</fpage>
          -
          <lpage>314</lpage>
          , San Francisco, CA, USA,
          <year>1999</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Smiljanic</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Keulen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Jonker</surname>
          </string-name>
          .
          <article-title>Using Element Clustering to Increase the Efficiency of XML Schema Matching</article-title>
          .
          <source>In ICDEW'06: Proceedings of the 22nd International Conference on Data Engineering Workshops</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>54</lpage>
          , Los Alamitos, CA, USA,
          <year>2006</year>
          . IEEE Computer Society Press.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Beech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maloney</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Mendelsohn</surname>
          </string-name>
          .
          <source>XML Schema Part</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Structures (Second Edition). W3C Recommendation</surname>
          </string-name>
          ,
          <year>October 2004</year>
          . www.w3.org/TR/xmlschema-1/.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiao-ling</surname>
          </string-name>
          , L.
          <article-title>Jin-feng, and D. Yi-sheng. An Adaptable and Adjustable Mapping from XML Data to Tables in RDB</article-title>
          .
          <source>In Proceedings of the VLDB'02 Workshop EEXTT and CAiSE 2002 Workshop DTWeb</source>
          , pages
          <fpage>117</fpage>
          -
          <lpage>130</lpage>
          , London, UK,
          <year>2003</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Similarity Metric for XML Documents</article-title>
          .
          <source>In FGWM03: Proceedings of Workshop on Knowledge and Experience Management</source>
          , Karlsruhe, Germany,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Cost-Driven Storage Schema Selection for XML</article-title>
          .
          <source>In DASFAA'03: Proceedings of the 8th International Conference on Database Systems for Advanced Applications</source>
          , pages
          <fpage>337</fpage>
          -
          <lpage>344</lpage>
          , Kyoto, Japan,
          <year>2003</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>