=Paper= {{Paper |id=Vol-256/paper-4 |storemode=property |title=An XML-to-Relational User-Driven Mapping Strategy Based on Similarity and Adaptivity |pdfUrl=https://ceur-ws.org/Vol-256/submission_19.pdf |volume=Vol-256 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Mlynkova07 }} ==An XML-to-Relational User-Driven Mapping Strategy Based on Similarity and Adaptivity== https://ceur-ws.org/Vol-256/submission_19.pdf
    An XML-to-Relational User-Driven Mapping Strategy Based
                on Similarity and Adaptivity ∗

                                                       c Irena Mlynkova
                                                       °
                                                        Charles University
                                               Faculty of Mathematics and Physics
                                               Department of Software Engineering
                                                      Malostranske nam. 25
                                                118 00 Prague 1, Czech Republic
                                                irena.mlynkova@mff.cuni.cz



                          Abstract                                   as efficient as the native ones due to the key problem of
                                                                     structural differences between XML data and relations.
     As XML has become a standard for data repre-                    The reason for the popularity is that relational databases
     sentation, it is inevitable to propose and imple-               are still regarded as universal and powerful data process-
     ment techniques for efficient managing of XML                   ing tools and their long theoretical and practical history
     data. A natural alternative is to exploit features              can guarantee a reasonable level of reliability and effi-
     and functions of (object-)relational database                   ciency. Contrary to native methods it is not necessary
     systems, i.e. to rely on their long theoreti-                   to start “from scratch” but we can rely on a mature and
     cal and practical history. The main concern of                  verified technology, i.e. properties that no native XML
     such techniques is the choice of an appropriate                 database can offer yet. On this account we believe that
     XML-to-relational mapping strategy.                             these methods and their possible improvements should
     In this paper we focus on enhancing of user-                    be further enhanced.
     driven techniques which leave the mapping de-                      The main concern of the database-based1 XML tech-
     cisions in hands of users. We propose an al-                    niques is the choice of an appropriate XML-to-relational
     gorithm which exploits the user-given annota-                   mapping strategy, i.e. the way the given XML data are
     tions more deeply searching the user-specified                  stored into relations. We can distinguish the following
     “hints” in the rest of the schema and applies an                three types approaches [4] [15]:
     adaptive method on the remaining schema frag-
     ments. We describe the algorithm theoretically,                    • fixed mapping methods based on predefined set of
     discussing the key ideas of the approach, cho-                       mapping rules and appropriate heuristics (e.g. [11]
     sen solutions, their reasons, and consequences.                      [21]),
     Finally, we overview the open issues related to
     implementation of the proposed algorithm and                       • adaptive methods which adapt the target database
     its experimental testing on real XML data.                           schema to the intended application (e.g. [12] [8]),
                                                                          and
1    Introduction                                                       • methods which leave the mapping decisions in
Without any doubt the eXtensible Markup Language                          hands of users (e.g. [3] [5]).
(XML) [9] is currently de-facto a standard for data rep-
resentation and manipulation. Its popularity is given by                 The first set of methods can be further divided [4]
the fact that the basic W3C recommendations are well-                into generic and schema-driven ones depending on omit-
defined, easy-to-learn, and at the same time still enough            ting or exploiting the existence of corresponding XML
powerful. The popularity naturally invoked a boom of                 schema. However, both the types use a straightforward
their efficient implementations based on various storage             mapping strategy regardless the intended future usage.
strategies from the traditional ones such as file systems            On the contrary, adaptive methods automatically adapt
to brand-new ones proposed particularly for XML struc-               the database schema of a fixed method to the given ad-
tures, so-called native approaches or directly native XML            ditional information. The best known representatives,
databases.                                                           so-called cost-driven methods, usually search a space
   Probably the most natural and practically used ap-                of possible XML-to-relational mappings and choose the
proach involves techniques which exploit features of                 one which conforms to the required application, speci-
(object-)relational database systems, though they are not            fied using a sample set of XML data and XML queries,
                                                                     the most, i.e. where the provided queries over the given
    ∗ This work was supported in part by the National Programme of
                                                                     data can be evaluated most efficiently. Finally, the last
Research (Information Society Project number 1ET100300419).
Proceedings of the Spring Young Researcher’s Colloquium                  1 In the rest of the paper the term “database” represents an (object-)
on Database and Information Systems, Moscow, Russia, 2007            relational database.
mentioned type of methods can be also divided into two        2   Motivation
groups. We distinguish so-called user-defined and user-
                                                              The key concern of this paper is to exploit the user-given
driven methods [15] which differentiate in the amount of
                                                              information as much as possible. We result from the idea
necessary user interaction. In the former case a user is
                                                              of user-driven enhancing of the user-defined techniques,
expected to define both the target database schema and
                                                              where a user is expected to help the mapping process, not
the required mapping strategy, i.e. to do all the work
                                                              to perform it. We want to go even farther. But first of all
“manually”. In the latter case a default mapping strategy
                                                              we discuss why user-given information is so important to
is defined but a user can specify local mapping changes
                                                              deal with.
from the predefined set of other allowed mapping strate-
                                                                  A simple demonstrative example can be a set of XML
gies, usually using schema annotations. In other words,
                                                              documents which contain various XHTML [1] frag-
the user-driven approach solves the main disadvantage of
                                                              ments. A classical fixed schema-driven mapping strat-
the user-defined one – the requirement of a user skilled
                                                              egy (e.g. [21] [14]) would decompose the fragments into
in two complex technologies who is, in addition, able to
                                                              a number of relations. Since we know that the standard
specify an optimal database schema for a particular ap-
                                                              XHTML DTD allows, e.g., complete subgraphs on up to
plication. Note that the user-driven techniques can be
                                                              10 nodes, the reconstruction of such fragments would be
regarded as a type of adaptive methods too [15], since
                                                              a really expensive operation in terms of the number of
they also adapt the default target schema to additional
                                                              join operations. But if we knew that the real complexity
information, in particular to user-specified requirements.
                                                              of such fragments is much simpler (and the analysis of
    In this paper we focus on further enhancing of user-
                                                              real XML data shows that it is quite probable [17]), e.g.
driven techniques, particularly on their (in our opinion)
                                                              that each of the fragments can be described as a simple
two main persisting disadvantages. The first one is the
                                                              text with tags having the depth of 2 at most, we could
fact that the default mapping strategy is (to our knowl-
                                                              choose a much simpler storage strategy including the ex-
edge) always a fixed one. It is quite a surprising finding
                                                              treme one – a CLOB column.
since we know that the proposed systems are able to store
                                                                  Another example can be the crucial feature of
schema fragments in various ways. From this point of
                                                              database storage strategies – updatability of the data. On
view an adaptive enhancing of the fixed method seems
                                                              one hand, we could know that the data will not be up-
to be quite natural and suitable. The second key short-
                                                              dated too much or at all but we need an effective query
coming we are dealing with is weak exploitation of the
                                                              evaluation. On the other hand, there could be a strong
user-given information. We believe that the schema an-
                                                              demand for effective data updates, whereas the queries
notations a user provides can not only be directly applied
                                                              are of marginal importance. And there are of course
on particular schema fragments, but the information they
                                                              cases which require effective processing of both. Nat-
carry can be further exploited. The main idea is quite
                                                              urally, the appropriate storage strategies differ strongly.
simple – we regard the annotations as “hints” how a user
                                                              In case of effective query processing a number of indices
wants to store particular XML patterns and we use this
                                                              and numbering schemes can be exploited but at the cost
information twice again. Firstly, we search for similar
                                                              of corresponding expensive updates. Effective updates,
patterns in the rest of the schema and store the found
                                                              conversely, require the simplest information of mutual
fragments in a similar way. And secondly, we exploit
                                                              data relationships. And if both the aspects are required,
the information in the adaptive strategy for not annotated
                                                              it is unavoidable to compromise. And such decision can
parts of the schema.
                                                              be again made correctly only if we have an appropriate
    To sum up, the main contribution of this paper is a
                                                              information on the required future usage.
proposal of an algorithm which enhances classical user-
                                                                  Last but not least, let us consider the question of data
driven strategies using the following two approaches:
                                                              redundancy. Without any additional information the op-
  • a deeper exploitation of the information carried in       timal storage strategy is so-called 4NF schema decompo-
    user-given schema annotations and                         sition into relations [5], where 4NF stands for the fourth
                                                              normal form, which can be achieved, e.g., using the clas-
  • an adaptive mapping strategy for not annotated parts      sical Hybrid algorithm [21], a representative of fixed
    of the schema.                                            mapping methods. The decomposition does not involve
                                                              data redundancy or violation of any normal form, i.e. it
    We describe the proposed algorithm theoretically. We      results in a database schema with the lowest number of
discuss the key ideas and problems, their possible solu-      relations and null attributes. But, similarly to database
tions, reasons for our particular decisions, and their con-   design, there can be reasonable real-world cases when
sequences. Finally, we overview the related problems          the data should not strictly follow the rules of normal
and open issues of a particular implementation of the         forms and their moderation can lead to more effective
proposal.                                                     query processing.
    The rest of the paper is structured as follows: Section       Both the cost-driven and user-driven methods are
2 contains a motivation for focusing on user-given infor-     based on the idea of exploiting additional user-given
mation. Section 3 overviews the existing related works in     information and they appropriately adapt the target
the area of user-driven methods, adaptive methods, and        database schema. In the former case it is extracted from
similarity of XML data. In the fourth section we describe     a sample set of XML documents and/or XML queries
and discuss the proposed algorithm in more detail and in      which characterize the typical future usage, in the lat-
the fifth section we sum up the corresponding implemen-       ter case it is specified by user-given annotations, i.e. the
tation open issues. Finally, Section 6 provides conclu-       user directly specifies the required changes of the default
sions and outlines our future work.                           mapping. But although there is a plenty of existing repre-
sentatives of the two approaches, there are still numerous     and semi-structured parts. For this purpose a sample set
weak points and open issues that should be improved and        of XML documents and XML queries is used.
solved [15].                                                       The other existing cost-driven approaches [8] [24]
   As mentioned above, our first improvement is search-        [26] use a different strategy. They define a set of XML-
ing for identical or similar fragments in the not annotated    to-XML transformations (e.g. inlining / outlining of an
schema parts. This approach has two main advantages:           element / attribute, splitting / merging of a shared ele-
                                                               ment2 , associativity, commutativity, etc.), a fixed XML-
    1. The user is not forced to annotate all schema frag-     to-relational mapping, and a cost function which eval-
       ments that have to be stored alternatively, but only    uates a relational schema against a given sample set of
       those with different structure. Thus the system is      XML data and/or queries. Using a search algorithm a
       not endangered of unintended omitting of annotat-       space of possible relational schemes is searched and the
       ing all similar cases.                                  optimal one is selected. Since it can be proven that even
                                                               a simple set of transformations causes the problem to
    2. The system can reveal structural similarities which     be NP-hard, the corresponding search algorithms in fact
       are not evident “at first glance” and which could re-   search for suboptimal solutions and exploit, e.g., heuris-
       main hidden to the user.                                tics, terminal conditions, approximations, etc.

   Thus the first main concern of our proposal is how to       3.2   Similarity of XML Data
identify identical or similar fragments within the schema.
   Our second enhancing focuses on the choice of the           Exploitation of similarity of XML data can be found in
mapping strategy for schema fragments which were nei-          various XML technologies, such as, e.g., document val-
ther annotated by the user, nor identified as fragments        idation, query processing, data transformation, storage
similar to the annotated ones. In this case we combine         strategies based on clustering, data integration systems,
the idea of cost-driven methods with the fact that a user-     dissemination-based applications, etc. Consequently, the
driven technique should support various storage strate-        number of existing works is enormous. We can search
gies too. Hence our second concern is how to find the          for similarity among XML documents, XML schemes,
optimal mapping strategy for the remaining schema frag-        or between the two groups. Furthermore, we can dis-
ments and, in addition, with exploitation of the informa-      tinguish several levels of similarity that can be taken into
tion we already have, i.e. the user-specified annotations,     account during the search process – a structural level (i.e.
as much as possible.                                           considering only the structure of the given XML frag-
                                                               ments), a semantic level (i.e. taking into account also
                                                               the meaning of element / attribute names), a constraint
3     Related Work
                                                               level (i.e. taking into account also various text value con-
As we have mentioned, methods which involve a user in          straints), etc.
the mapping process can be divided into user-defined and           In case of document similarity we distinguish tech-
user-driven. Probably due to simple implementation the         niques expressing the similarity of two documents D1
former ones are supported in most commercial database          and D2 by measuring how difficult is to transform D1
systems [2]. On the other hand, the set of techniques of       into D2 (e.g. [19]) and techniques which specify a simple
the latter type is surprisingly small. To our knowledge        and reasonable representation of D1 and D2 that enables
there are just two main representatives of the approach        their efficient comparison and similarity evaluation (e.g.
– so-called Mapping Definition Framework (MDF) [3]             [25]). In case of similarity of document D and schema S
and XCacheDB System [5]. Both support inlining and             there are also two types of strategies – techniques which
outlining of an element / attribute, mapping an element        measure the number of elements which appear in D but
/ attribute to a CLOB column, renaming target tables /         not in S and vice versa (e.g. [6]) and techniques which
columns, and redefining column data types. The former          measure the closest distance between D and all docu-
approach furthermore supports the Edge mapping [11]            ments valid against S (e.g. [18]). Finally, methods for
strategy and enables to specify the required capturing of      measuring similarity of two XML schemes S1 and S2 ex-
the structure of the whole schema. The latter one, in ad-      ploit and combine various supplemental information and
dition, allows a certain degree of redundancy.                 measures such as, e.g., predefined similarity rules, sim-
    In both the cases the mapping for not annotated parts      ilarity of element / attribute names, equivalence of data
is fixed and the annotations are applied just directly on      types and structure, schema instances, thesauri, previous
the annotated schema fragments. The two ideas we want          results, etc. (e.g. [13] [10] [22])
to use for their enhancing are adaptivity [15] and similar-
ity [16].                                                      4 Proposed Algorithm
3.1    Adaptive XML-to-Relational Mapping                      The general idea of fixed schema-driven XML-to-
                                                               relational mapping methods is to decompose the given
Probably the first proposal of an adaptive cost-driven         XML schema S into a set of schema fragments
method can be found in [12]. It is based on the idea           F = {f1 , f2 , ..., fn }. Each fi ∈ F is then mapped to
of storing well structured parts of XML documents              a corresponding relation ri using a mapping strategy
into relations (using the 4NF decomposition) and semi-         sf rag . The relationship between fragments is kept us-
structured parts using an XML data type, which supports        ing a mapping strategy srel . In other words, the 3-tuple
path queries and XML-aware full-text operations. The
main concern of the method is to identify the structured         2 An element with multiple parent elements in the schema – see [21].
τ = < F, sf rag , srel > specifies a particular XML-to-          4.1   Searching for Similar Fragments
relational mapping method. An extreme case is when
                                                                 As we have mentioned, there are numerous approaches to
S is considered as a single fragment which results in a
                                                                 measuring similarity of XML data. Nevertheless, most
single relation, usually with many null values. Other ex-
                                                                 of them cannot be directly used for our case since our
treme occurs when each element in S is considered as a
                                                                 demanded key characteristics differ. In particular, we
single fragment resulting in a huge number of relations
                                                                 search for similarity within the scope of a single schema,
and thus numerous join operations.
                                                                 the similarity measure should not depend on similarity
    Note that fixed generic methods can be described us-
                                                                 of element / attribute names but primarily on complex-
ing the 3-tuple τ as well. Each of the techniques views
                                                                 ity of content models, and the similarity measure cannot
an XML document as general directed tree with several
                                                                 obviously depend on the context of fragments.
types of nodes. And the view can be considered as a spe-
                                                                     Considering the problem in more depth, several fun-
cial kind of XML schema.
                                                                 damental questions arise:
    In user-driven strategies the three characteristics are
influenced by user-defined annotations which specify               1. How are the annotated fragments defined?
how a particular user wants to store selected fragments
F 0 = {f10 , f20 , ..., fm
                         0
                           }. The user usually provides S          2. What types of annotations, i.e.            fixed mapping
with annotating attributes from the predefined set of at-             strategies, are supported?
tribute names ΣA0 , which represent various fixed map-
ping strategies, resulting in an annotated schema S 0 .            3. What measure is used for measuring similarity of
Obviously, annotated fragments in F 0 do not have to                  two schema fragments?
correspond to schema fragments in F . For instance, a              4. Can we optimize the exhaustive search strategy?
user may specify that whole S should be stored using a
4NF decomposition and thus F 0 = {S 0 } is a singleton,             The answers for the questions mutually influence each
whereas corresponding F can contain several schema               other and specify the algorithm. Furthermore, the defini-
fragments depending on the complexity of S. Also note            tion of annotated fragments together with the question of
that while F should cover whole S, F 0 usually does not.         their mutual intersection are closely related to supported
    A classical user-driven strategy consists of the follow-     mapping strategies.
ing steps:
                                                                 4.1.1 Annotated Fragments
 1. S is provided with annotations from ΣA0 resulting
    in S 0 and F 0 .                                             First of all, for easier processing we define a graph rep-
                                                                 resentation of an XML schema S, no matter if annotated
 2. Annotated fragments from F 0 are decomposed                  or not. For easier explanation we assume that the given
    into corresponding schema fragments F1 =                     XML schema S is expressed in DTD3 [9], nevertheless
    {f1 , f2 , ..., fk }.                                        the algorithm can be applied to schemes expressed using,
                                                                 e.g., XML Schema [23] [7] language as well.
 3. Not annotated fragments of S 0 are decomposed into
    schema fragments F2 = {fk+1 , fk+2 , ..., fn } using         Definition 1 A schema graph of an XML schema S is
    a default (predefined or user-defined) fixed strategy        a directed, labelled graph GS = (V, E, ΣE , ΣA , lab),
    sdef .                                                       where
 4. F = F1 ∪ F2 is mapped to R using sf rag and srel .             • V is a finite set of nodes,

   Using this notation our proposed enhancing simply               • E ⊆ V × V is a set of edges,
adds the following steps between the step 1 and 2:                 • ΣE is a set of element names in S,
           0      0
a. For ∀ f ∈ F        we identify a set Ff0 0 of all fragments     • ΣA is a set of attribute names in S, and
   similar to f 0 occurring in S \{f }.
                                  0    0
                                                                   • lab : V → ΣE ∪ ΣA ∪ {“|”, “*”, “+”, “?”, “,”} ∪
b. For ∀ f 0 ∈ F 0 all fragments in Ff0 0 are annotated with         {pcdata} is a surjective function which assigns a
   annotating attributes of f 0 and F 0 = F 0 ∪ Ff0 0 .              label to ∀v ∈ V .

c. S 0 \F 0 is annotated using an adaptive strategy which        Definition 2 A fragment of a schema S is each subgraph
   adds new annotated fragments to F 0 .                         of S consisting of an element e, all its descendants, and
                                                                 corresponding edges.
   The mapping process is schematically depicted in Fig-            Φ is a set of all fragments of S.
ure 1 where the given schema S with two annotated frag-
ments f and g is mapped to a database schema R. If the              Next, we assume that each annotated fragment f 0 ∈
                                                                   0
proposed enhancing (i.e. steps 1.a – 1.c) is included, the       F is uniquely determined by the element e which was
system gradually identifies and adds new annotated frag-         annotated using an annotating attribute a ∈ ΣA0 .
ments f1 , f2 , g1 , g2 , and g3 which are mapped using
user-required mapping strategies. If the enhancing is not        Definition 3 An annotated element e0 of schema S 0 is an
included (i.e. in case of a classical user-driven strategy),     element provided with an annotated attribute from ΣA0 .
only fragments f and g are annotated using user-required             3 We omit supplemental constructs such as entities, CDATA sec-
strategies and the rest of the schema using sdef .               tions, comments, etc.
                                               Figure 1: Schema of the mapping process

Definition 4 An annotated fragment f 0 of schema S 0 is a                of repetitions of a recursive element is small, on average
fragment of S 0 rooted at an annotated element e0 exclud-                less than 5. The two findings enable to treat recursive el-
ing all annotating attributes from ΣA0 .                                 ements not as elements with theoretically infinite depth,
                                                                         but in a much simpler way. We discuss the details later
   As we want to support shared elements and recursion,                  in the text.
since both the constructs are widely used in real XML                        In the following text we assume that a schema graph
data [17], we must naturally allow the annotated frag-                   of an XML schema is always an expanded schema graph,
ments to intersect almost arbitrarily. To simplify the sit-              if not explicitly stated alternatively.
uation, we define an expanded schema graph, which ex-
ploits the idea that both the constructs purely indicate                 4.1.2 Types of Annotations
repeated occurrence of a particular pattern.
                                                                         From Definitions 4 and 5 we can easily prove the follow-
                                                                         ing two statements:
Definition 5 An expanded schema graph Gex  S is a result
of the following transformations of schema graph GS :                    Lemma 1 Each expanded schema graph Gex
                                                                                                             S is a tree.

 1. Each shared element is duplicated for each sharer                    Lemma 2 Two annotated fragments fx0 and fy0 of an ex-
    using a deep copy operation, i.e. including all its                  panded schema graph Gex                         0    0
                                                                                              S 0 can intersect only if fx ⊆ fy
    descendants and corresponding edges.                                     0    0
                                                                         or fx ⊆ fy .
 2. Each recursive element is duplicated for each re-                       Furthermore, we can observe that the common schema
    peated occurrence using a shallow copy operation,                    fragment, i.e. the intersection, contains all descendants
    i.e. only the element node itself is duplicated.                     of a particular element.
                                                                            We distinguish three types of the annotation intersec-
   An illustrative example of a schema graph GS and its                  tion depending on the way the corresponding mapping
expanded schema graph Gex   S is depicted in Figure 2. A                 strategies influence each other.
shared element is highlighted using a dotted rectangle, a
recursive element is highlighted using a dotted circle.                  Definition 6 Intersecting annotations are redundant if
                                                                         the corresponding mapping strategies are applied on the
                                                                         common schema fragment separately.

                                                                         Definition 7 Intersecting annotations are overriding if
                                                                         only one of the corresponding mapping strategies is ap-
                                                                         plied on the common schema fragment.

                                                                         Definition 8 Intersecting annotations are influencing if
Figure 2: A schema graph GS and an expanded schema                       the corresponding mapping strategies are combined re-
graph Gex                                                                sulting in one composite storage strategy applied on the
        S
                                                                         common schema fragment.
   As it is obvious, in case of shared elements the expan-
sion is lossless operation. It simply omits the key advan-                   Redundant annotations can be exploited, e.g., when
tage of shared elements which allows reusing of previ-                   a user wants to store XHTML fragments both in a sin-
ously defined schema fragments. The situation is more                    gle CLOB column (for fast retrieval of the whole frag-
complicated in case of recursive elements which need to                  ment) and, at the same time, into a set of tables (to enable
be treated in a special way henceforth. For this purpose                 querying particular items). An example of overriding an-
we exploit results of statistical analysis of real-world re-             notations can occur when a user specifies a general map-
cursive elements [17], particularly the fact that the most               ping strategy for the whole schema S and then annotates
common type of recursion is linear4 and that the number                  fragments which should be stored differently. Naturally,
                                                                         in this case the strategy which is applied on the common
  4 Consisting of a single recursive element that does not branch out.   schema fragment is always the one specified for its root
element. The last mentioned type of annotations can be               Algorithm 1 Single Annotation Strategy (SAS)
used in a situation when a user specifies, e.g., the 4NF             Input: S 0 , F 0 , sim(fx , fy ), Tsim
decomposition for a particular schema fragment and at                Output: F 00 , i.e. F 0 ∪ newly annotated fragments
the same time an additional numbering schema which                       {construction of the similarity matrix}
speeds up processing of particular types of queries. In               1: F 00 ← F 0
this case the numbering schema is regarded as a supple-               2: for all f 0 ∈ F 0 do
mental index over the data stored in relations of 4NF de-             3:    for all element e ∈ S 0 do
composition, i.e. the data are not stored redundantly as              4:       fe ← subgraph rooted at e
in the first case.                                                    5:       if e ∈ S 0 \{f 0 } then
    We assume that each pair of annotations implemented               6:          sim[f 0 , fe ] ← sim(f 0 , fe )
in a corresponding system is assigned its intersection                7:       else
type or, if necessary, a particular combination of anno-              8:          sim[f 0 , fe ] ← 0
tations can be denoted as forbidden. The existing sys-                9:       end if
tems [3] [5] mostly support overriding annotations, the              10:    end for
XCacheDB system [5], in addition, supports a type of                 11: end for
redundant intersection similar to the above described ex-                {annotation strategy}
ample. Furthermore, we assume that the implemented                   12: for all element e ∈ S 0 do
annotations are assigned priorities which specify the or-            13:    maxe ← maxf 0 ∈F 0 {sim[f 0 , fe ]}
der in which they are composed when applied on com-                  14:    fmax ← f 0 ∈ F 0 s.t. sim[f 0 , fe ] = maxe
mon schema fragment.                                                 15:    if maxe > Tsim then
                                                                     16:       fe .annotation ← fmax .annotation
4.1.3   Search Algorithm                                             17:       F 00 ← F 00 ∪ {fe }
                                                                     18:    end if
The similarity measure, the search algorithm, and its pos-           19: end for
sible optimization are closely related. However, the main            20: return F 00
idea of the enhancing of user-driven techniques remains
the same regardless the chosen measure and algorithm.                    The question is whether this approach is correct actu-
The choice of the measure influences the precision and               ally. From another point of view it could be more rea-
scalability of the system, whereas the algorithm influ-              sonable to annotate an element e using annotations of
ences the efficiency of finding the required fragments.              all fragments f 0 ∈ F 0 s.t. sim(f 0 , fe ) > Tsim . To-
In case of “classical” adaptive methods this can be a                gether with the assumption that for each pair of anno-
marginal aim, since the schema adaptation is performed               tations the result of their composition is predefined and
only once, before the schema is created. But its impor-              that the annotations have priorities according to which
tance significantly rises when the system needs to be dy-            they are composed, this approach seems to be a better
namic and adapt continuously.                                        choice since it does not omit important information. But,
    Let us suppose that we have a similarity measure                 on the other hand, let us consider the situation depicted
sim(fx , fy ) ∈ [0, 1] expressing similarity of two frag-            in Figure 3, where for i ∈ {1, 2, 3} sim(f 0 , fi ) > Tsim
ments fx and fy of an expanded graph, where 1 rep-                   and f 0 is the annotated fragment.
resents strong similarity and 0 strong dissimilarity, and
a similarity threshold Tsim ∈ [0, 1]. A naive strategy
would exploit an exhaustive search as depicted by Algo-
rithm 1.
    The algorithm evaluates similarity of each annotated
fragment f 0 ∈ F 0 and each fragment fe rooted at an el-
ement e ∈ S 0 \{f 0 }. We can assume that if the storage
strategy for any fe should not change, the user would
mark it as final. For such fragment either the default                  Figure 3: Similar fragments on the same root path
strategy sdef or the strategy specified by corresponding
                                                                        The problem is whether we can annotate all the three
user-defined annotation will be used, regardless the re-
                                                                     fragments f1 , f2 , f3 using the annotation of f 0 , espe-
sults of the search or adaptive algorithm. As such this
                                                                     cially what will be the result of intersection in case of
situation is rather the problem of implementation than a
                                                                     f1 and f3 or f2 and f3 , i.e. fragments occurring on
theoretical one and thus we further assume that there are
                                                                     the same root path5 . We can naturally assume that in-
no such fragments in S 0 . On the other hand, we natu-
                                                                     tersection of two identical annotations is overriding and
rally regard fragments annotated by a user to be final by
                                                                     as such has no effect. Thus we could annotate only the
default.
                                                                     topmost fragment on each root path. In case of example
    The resulting similarity values are stored into so-              in Figure 3 this rule would be applied twice, resulting
called similarity matrix {sim[f 0 , fe ]}f 0 ∈F 0 , e∈S 0 . An el-   in a single annotation of fragment f3 . But what if we
ement e is annotated if there exists a fragment fmax ∈ F 0           knew, in addition, that sim(f 0 , f1 ) > sim(f 0 , f3 ) and
with the highest similarity value sim(fmax , fe ) > Tsim .           sim(f 0 , f2 ) > sim(f 0 , f3 )? As it is obvious, in such
In Algorithm 1 we assume that there is always one such               case it is seems to be more reasonable and natural to an-
candidate at most. Otherwise, the system can ask for                 notate fragments f1 and f2 rather than whole f3 . Or this
user intervention when necessary. We call this approach
a single annotation strategy (SAS).                                    5 A path from the root node to a leaf node.
situation can be again a case for user intervention, de-             If we do not know any features of the measure, there
pending on the point of view of it. We will further con-         are not many ways how to avoid the exhaustive search.
sider the former one.                                            Also the order in which fragments in F 0 are processed
    If we generalize the idea, the algorithm annotates an        is then unimportant. But although we can assume that
element e using annotations of all fragments f 0 ∈ F 0           card(F 0 ) = m is small, i.e. that a user annotates sev-
s.t. sim(f 0 , fe ) > Tsim and 6 ∃ element e0 on any root        eral fragments but the number is not large, the exhaus-
path traversing e s.t. sim(f 0 , fe0 ) > sim(f 0 , fe ). The     tive search can be expensive due to the size of GexS . And
resulting algorithm, so-called multiple annotation strat-        even from the simple example in Figure 4 it is obvious
egy (MAS), is depicted by Algorithm 2, where e.ancs              that there are pairs of schema fragments which do not
denotes a set of (direct or undirect) ancestors of element       have to be compared at all. Another problem is the com-
e and e.descs denotes a set of (direct or undirect) descen-      plexity of the if condition of Algorithm 2 (line 5) which
dants of e. The process of construction of the similarity        can in the worst case lead to multiple searching through
matrix remains the same as in case of Algorithm 1.               the whole Gex S . So in both the cases we need to avoid the
                                                                 unnecessary similarity evaluations.
Algorithm 2 Multiple Annotation Strategy (MAS)                       It seems promising to borrow the idea of cluster-
Input: S 0 , F 0 , sim(fx , fy ), Tsim                           ing, similarly to paper [22], where the distance between
Output: F 00 , i.e. F 0 ∪ newly annotated fragments              schema fragments is determined by their mutual similar-
 1: F 00 ← F 0                                                   ity, e.g. dist(fx , fy ) = 1 − sim(fx , fy ). An example is
    {construction of the similarity matrix}                      depicted in Figure 5 for the sample schema in Figure 4.
 2: -//-
    {annotation strategy}
 3: for all f 0 ∈ F 0 do
 4:    for all element e ∈ S 0 do
 5:      if (sim[f 0 , fe ] > Tsim ) ∧
         (6 ∃ea ∈ e.ancs : sim[f 0 , fea ] > sim[f 0 , fe ]) ∧
         (6 ∃ed ∈ e.descs : sim[f 0 , fed ] > sim[f 0 , fe ])
          then
 6:          fe .annotation ← f 0 .annotation
 7:          F 00 ← F 00 ∪ {fe }
 8:      end if                                                             Figure 5: Exploitation of clustering
 9:    end for
10: end for                                                          All schema fragments (depicted using black-filled cir-
11: return F 00                                                  cles) are divided into clusters C1 , C2 ,..., Ck (depicted us-
                                                                 ing black circular lines) having their centroids c1 , c2 ,...,
   Using this approach we should consider what will              ck and radii r1 , r2 ,..., rk (or one common radius r, de-
happen in case a user annotates two structurally identi-         pending on the implementation). Having this informa-
cal (or too similar) fragments using different annotations.      tion, only those schema fragments have to be compared
We cannot simply rely on predefined type of their inter-         with fragment f , whose clusters intersect the cluster with
section and corresponding priorities, because the situa-         centroid f and radius Tsim . In case of Figure 5 these are
tion is a slightly different one. In this case the system        clusters C1 and C2 . Obviously, if the clusters were se-
should rather ask for user intervention whenever it is not       lected appropriately, the amount of comparisons would
able to decide. And this is again a problem of the partic-       decrease rapidly. Hence the key concern of all clustering
ular implementation.                                             algorithms is mainly the construction of the clusters.
                                                                     The construction is usually performed using a k-
4.1.4   Similarity Measure and Optimization of the               means algorithm or its variations (e.g. [22]), where the
        Search Algorithm                                         initial clusters are selected randomly and then iteratively
Now let us consider the search strategy from the point           improved. In the i-th iteration each fragment is com-
of view of complexity of the algorithm. Figure 4 depicts         pared with centroids of all clusters and assigned to the
an example of processing a single annotated fragment, in         closest one. The algorithm terminates if none of the clus-
particular the amount of similarity comparisons. Anno-           ters changes, otherwise new centroids are computed and
tated fragments f and g are highlighted using rectangles,        (i + 1)-th iteration follows. The complexity of the con-
all schema fragments, which are compared with f are              struction is O(I · |Φ| · k), where I is the number of it-
highlighted using dotted ovals.                                  erations. In case of complexity of similarity evaluation
                                                                 the worst case is when either k = 1 or all k clusters
                                                                 mutually intersect, i.e. when we cannot avoid any of
                                                                 the similarity comparisons. Hence in the worst case the
                                                                 number of comparisons is the same as in the exhaustive
                                                                 search strategy and the complexity can worsen only the
                                                                 pre-processing, i.e. the construction of clusters. And this
                                                                 is the step we want to remove too.
                                                                     For further optimization we can exploit characteris-
                                                                 tics of the chosen similarity measure. The existing algo-
           Figure 4: Exhaustive search strategy                  rithms for measuring similarity on schema level usually
exploit various supplemental matchers [20], i.e. func-         we suppose that these are m1 , m2 , ..., mq . For instance a
tions which evaluate similarity of a particular feature of     trivial matcher with such behavior can compare the num-
the given schema fragments, such as, e.g., similarity of       ber of distinct element or attribute names, the amount of
number and types of leaf nodes, similarity of root ele-        similar operators, the depth of the corresponding regu-
ment names, similarity of context, etc.                        lar expression, etc. The heuristic is then based on the
                                                               idea that if at least “sufficient amount” of the q matchers
Definition 9 A matcher is a function m : Φ2 → [0, 1]           exceed their extreme value, we can terminate processing
which evaluates similarity of a particular feature of two      of the current root path too. There are just two differ-
schema fragments fx , fy ∈ Φ.                                  ences from the previously mentioned idea. Firstly, the
                                                               exceeding of the particular extreme is expressed using
Definition 10 A partial similarity measure is a function       a threshold Tex ∈ [0, 1] which guarantees that process-
mpart : Φ2 → [0, 1]p which evaluates similarity of             ing of the current root path does not terminate neither
the given schema fragments fx , fy ∈ Φ using matchers          too soon (to reach the optimum of the composite similar-
m1 , m2 , ..., mp : Φ2 → [0, 1] and returns a p-tuple of       ity measure) nor too late (to avoid unnecessary similar-
their results.                                                 ity evaluations). Secondly, as it is obvious, the matchers
                                                               themselves are not precise in terms of similarity evalu-
   Then the partial results are combined using an appro-       ation. Hence each of them is assigned a user-specified
priate approach, usually a kind of a weighted sum, into        reliability r1 , r2 , ..., rq ∈ [0, 1], where 0 expresses strong
the resulting composite similarity value.                      unreliability and 1 strong reliability of the matcher. The
                                                               reliabilities then influence the real value of threshold Tex
Definition 11 A composite similarity measure is a func-        for particular matchers multiplying its value.
tion mcomp : [0, 1]p → [0, 1] which combines the results
                                                                   The whole optimization of the approach, so-called
of particular matchers and returns the total similarity
                                                               basic annotation strategy (BAS), is depicted by Algo-
value.
                                                               rithm 3, where function terminate returns true if the
    Due to the features of selected partial matchers the       search algorithm should terminate in the given node, oth-
existing techniques usually exploit a bottom-up strategy,      erwise it returns false. Furthermore, we assume that each
i.e. starting from leaf nodes towards the root node. To-       element of the graph is assigned an auxiliary list of can-
gether with the previously mentioned problem of simi-          didates consisting of pairs < f ragment, similarity >,
lar intersecting fragments this is why we need to know         i.e. references to fragments (and corresponding similar-
the behavior of the similarity measure on particular root      ity values) within its subtree that are candidates for an-
paths.                                                         notation.
    For instance, if we knew that the similarity measure           The algorithm processes schema graph starting from
is concave, i.e. that it has only one global maximum,          leaf nodes. For each root path the optimal similarity
we could skip processing of all the ancestors on the           value and the reference to corresponding fragment are
current root path whenever we reach the fragment with          propagated until a better candidate is found or the con-
the extreme value. A sample situation can be seen in           dition of the heuristic is fulfilled. Then the processing
Figure 6 which depicts an example of a graph of sim-           of the current root path is terminated and current can-
ilarity function for an annotated fragment f 0 and frag-       didates are annotated. The complexity of the algorithm
ments f1 , f2 , ..., fr on a single root path. From the        depends on the heuristics. In the worst case it does not
graph we can see, that only fragments f1 , f2 , f3 , f4 need   enable to skip processing of any node that results in the
to be processed (f4 for testing the extremity), then the       exhaustive search. But in contrast to the clustering ap-
similarity evaluation can terminate, skipping fragments        proach we have avoided the expensive preprocessing of
f5 , f6 , ..., fr .                                            the algorithm.
    As it is obvious, this way we can decrease the num-            In general we could use an arbitrary similarity mea-
ber of unnecessary similarity evaluations as well as avoid     sure, not exactly the above defined composite one. It
pre-processing of the schema and expensive checking of         is also possible to use disjoint sets of matchers for the
the if condition of Algorithm 2. Naturally, the efficiency     heuristic and for the composite similarity measure. Nev-
of such approach depends strongly on the position of the       ertheless, we will deal with the above described ones,
extreme on the root path. The key problem is how to de-        since it is the typical and verified way for evaluating sim-
fine such similarity measure. For our purpose we need          ilarity among XML schemes.
a measure which focuses especially on the structure of
the compared fragments, known equivalences or rela-            4.1.5 Recursive Elements
tions between regular expressions, differences between
simple and complex types, etc., i.e. on features that in-      Last but not least, we have to solve the open problem of
fluence the efficiency of database processing the most.        expanded recursive elements, since the expansion is not
But it is hard, if not impossible, to propose a measure        a lossless operation as in case of shared elements. As
with concave behavior which is at the same time enough         we have already indicated, we will exploit the results of
precise in relation to these requests. Nevertheless, we        analysis of real-world XML data which shows two im-
can exploit a relaxed version of this idea as a kind of        portant aspects [17]:
heuristic of the bottom-up strategy.
    Although we can hardly ensure that mcomp is con-             1. Despite it is generally believed that recursive ele-
cave, we can assume that at least q of the matchers, where          ments are of marginal importance, they are used in
1 ≤ q ≤ p, have this property. Without loss of generality           a significant portion of real XML data.
                                Figure 6: Exploitation of behavior of similarity function

 2. Although the recursive elements can have arbitrarily          Under a closer investigation we can see that the user-
    complex structure, the most common type of recur-          given annotations provide a similar information – they
    sion is linear and the average depth of recursion is       say how particular schema fragments should be stored
    low.                                                       to enable efficient data querying and processing. Thus
                                                               we can simply reuse the user-given information. For this
    If we realize that we need the “lost” information about    purpose we define an operation contraction which en-
recursion only at one stage of the algorithm, the solution     ables to omit those schema fragments, where we already
is quite obvious. We analyze the structure of schema           know the storage strategy, and focus on the remaining
fragments when evaluating matchers m1 , m2 , ..., mp ,         ones.
whereas each of the matchers describes similarity of a
particular feature of the given fragments. In case the         Definition 12 A contraction of an expanded schema
fragments contain recursive elements we will not use the       graph Gex                                    0
                                                                        S 0 with annotated fragment set F is an opera-
                                                                                                     0     0
exact measure, but its approximation with regard to the        tion which replaces each fragment f ∈ F with a single
real complexity of recursive elements. For instance if         auxiliary node called a contracted node.
the matcher analyzes the maximum depth of fragment                The resulting graph is called a contracted graph Gcon
                                                                                                                    S0 .
containing a recursive element, the resulting depth is not
infinite, but considers the average depth of real-world re-       From Definitions 4 and 12 we can easily prove the
cursive elements.                                              following simple statement which enables to reuse the
                                                               search algorithm on the contracted graph.
    The question is whether it is necessary to involve
a matcher which analyzes the amount of recursive ele-          Lemma 3 Contracted graph Gcon S 0 of a connected ex-
ments in schema fragments. On one hand it can increase         panded schema graph Gex
                                                                                    S 0 is connected.
the precision of the composite measure. But from an-
other point of view the approximation transforms the re-          The basic idea of the adaptive strategy is as follows:
cursive element to a “classical” element and hence such        Having a contracted graph GconS 0 we repeat the search al-
matcher can be misleading.                                     gorithm (in particular its slight modification) and oper-
                                                               ation contraction until there can be found any fragment
4.2   Adaptive Mapping Strategy                                to annotate. Contrary to the previous situation the search
                                                               algorithm has the following differences:
At this stage of the algorithm we have a schema S 0 and
a set of annotated fragments F 0 which involve the user-         • It searches for schema fragments which are not in-
defined fragments and fragments identified by Algorithm            volved in the schema, i.e. it searches among all
3. As the second enhancing we want to apply an adaptive            nodes of the given (contracted) graph and returns
mapping strategy to the remaining parts of the schema.             the (eventually empty) set of found fragments.
    As mentioned previously, the key idea of adaptive
strategies is to adapt the target relational schema R to         • For similarity evaluation we do not take into ac-
the expected future application, which is specified by             count contracted nodes, i.e. neither their current,
a sample set of XML data and/or XML queries. The                   nor their previous structure or any other features.
techniques define a set of XML-to-XML transformations              These were already analyzed and processed in pre-
which produce a space of equivalent or more general                vious steps of the algorithm.
XML schemes. A search algorithm is used to find a
                                                                 • The annotations of contracted nodes are always
schema, where the evaluation of the given queries is most
                                                                   overriding in relation to the newly defined ones.
efficient. Thus at first glance the user-driven techniques
have nothing in common with the adaptive ones. Or, we            • The required threshold is more precise, i.e. we use
could expect that the user provides not only a set of an-          a threshold Tcon < Tsim .
notations, but also sample XML documents and XML
queries. Then we could use a classical adaptive strategy          We denote this modification of BAS as a contraction-
for not annotated parts of schema S 0 . Such combina-          aware annotation strategy (CAS). (We omit its formal
tion should not be difficult considering the fact that user-   description for obviousness and the paper length.) The
driven techniques enable to store various schema frag-         resulting annotating strategy, so called global annotation
ments in various ways. Nevertheless, the key shortcom-         strategy (GAS), is depicted by Algorithm 4, where func-
ing of such approach is that the user is expected to pro-      tion contract applies operation contraction on expanded
vide too many information.                                     graph of the given schema and corresponding fragments,
Algorithm 3 Basic Annotation Strategy (BAS)
Input: S 0 , F 0 , m1 , m2 , ..., mq , mq+1 , ..., mp , r1 , r2 , ..., rq , Tex , mcomp , Tsim
Output: F 00 , i.e. F 0 ∪ newly annotated fragments
 1: F 00 ← F 0
 2: for all f 0 ∈ F 0 do
 3:    listToProcess ← leaf elements of Gex              0
                                                  S 0 \{f }
 4:    listOfProcessed ← ∅
 5:    while listToProcess 6= ∅ do
 6:       for all e ∈ listToProcess do
 7:          e.candidates ← ∅
 8:          fe ← subgraph rooted at e
 9:          sime ← mcomp (f 0 , fe )
10:          for all c ∈ e.subelems do
11:             for all < f, sim > ∈ c.candidates do
12:                 if sim > sime then
13:                    e.candidates ← e.candidates ∪ {< f, sim >}
14:                 end if
15:             end for
16:          end for
17:          if e.candidates = ∅ ∧ sime > Tsim then
18:             e.candidates ← e.candidates ∪ {< fe , sime >}
19:          end if
20:          if terminate(f 0 , e, m1 , m2 , ..., mq , r1 , r2 , ..., rq , Tex ) then
21:             for all < f, sim > ∈ e.candidates do
22:                 f .annotation ← f 0 .annotation
23:                 F 00 ← F 00 ∪ {f }
24:             end for
25:          else
26:             if ∀ s ∈ e.siblings : s ∈ listOfProcessed then
27:                 listToProcess ← listToProcess ∪ {e.parent}
28:             end if
29:          end if
30:          listToProcess ← listToProcess \ {e}
31:          listOfProcessed ← listOfProcessed ∪ {e}
32:       end for
33:    end while
34: end for
35: return F 00

Algorithm 4 Global Annotation Strategy (GAS)                            interested in the efficiency of the resulting XML-to-
Input: S 0 , F 0 , m1 , m2 , ..., mq , mq+1 , ..., mp , r1 , r2 ,       relational storage strategy which can be verified only
    ..., rq , Tex , mcomp , Tsim , Tcon                                 through appropriate tests on real XML data. We discuss
Output: F 00 , i.e. F 0 ∪ newly annotated fragments                     it in the following section.
 1: F 00 ← BAS(S 0 , F 0 , m1 , m2 , ..., mp , r1 , r2 , ..., rq ,
    Tex , mcomp , Tsim )                                                5     Open Issues
 2: F tmp ← F 00
                                                                        In the previous section we have described and discussed
 3: while F tmp 6= ∅ do
                                                                        the proposed algorithm on theoretical level. We have
 4:     contract(S 0 , F tmp )
                                                                        mentioned several possible solutions and discussed their
 5:     F tmp ← CAS(S 0 , F 0 , m1 , m2 , ..., mp , r1 , r2 , ...,
                                                                        consequences and disadvantages as well as reasons for
        rq , Tex , mcomp , Tcon )
                                                                        the choices we have made. But despite the detailed de-
 6:     F 00 ← F 00 ∪ F tmp
                                                                        scription there are still several open issues. We distin-
 7: end while
                                                                        guish two main categories:
 8: expand contractions(S 0 , F 00 )
 9: return F 00
                                                                            1. Features of the particular implementation and
and function expand contractions expands all the con-                       2. Behavior of the algorithm on real XML data.
tracted nodes of the given schema to the original ones.
   The resulting complexity of the algorithm depends on                    As for the former case the key implementation deci-
the number of iterations of the cycle (lines 3 – 7). In the             sions are especially:
worst case each iteration results in annotating of a single
element, i.e. the search algorithm repeats (|Φ|−|F 0 |+1)                   • the set of supported schema annotations, types of
times.                                                                        their mutual intersection (or forbiddance), and their
   Considering the whole approach, we are especially                          priorities,
    • the matchers m1 , m2 , ..., mq , mq+1 , ..., mp and             Another improvement can be an exploitation of the se-
      corresponding reliabilities r1 , r2 , ..., rq ,             mantic of element and/or attribute names. The similarity
                                                                  can be searched not only on structural level, but using a
    • the composite similarity measure mcomp , and                kind of thesaurus or appropriate user-given information.
    • the thresholds Tsim , Tex , and Tcon .                      Although our proposal focuses mainly on structural sim-
                                                                  ilarities related to efficiency of database processing, it is
    The key problem lies especially in tuning of reliabili-       at least worth testing whether the semantic of the names
ties and thresholds, since both influence the precision and       carries additional important information useful for this
efficiency of the system strongly. The remaining charac-          purpose too.
teristics are related rather to its usefulness and versatility.       Next interesting task could be also a combination of
There are also marginal questions such as, e.g., whether          our approach with a real cost-driven one, i.e. an exploita-
the system will support final elements or user interven-          tion of both user-given annotations and a sample set of
tion if there are more candidates for a particular situa-         XML data and XML queries together. As we have al-
tion. But these features do not have the key influence on         ready mentioned, the key disadvantage is in the amount
the proposed approach itself.                                     of required input data. But, on the other hand, the com-
    The latter category of open issues is quite unpre-            bination of the two approaches could bring interesting
dictable, despite the existing statistics of real XML data        results or the efficiency of the two approaches can be at
[17]. It is caused mainly by two facts. Firstly, although         least compared.
we know the usual characteristics of the real data, we                And last but not least, the key enhancing lies in dy-
cannot predict especially the behavior of more complex            namic adaptability of the system [15]. This challenging
similarity measures due to the above mentioned tuning             but non-trivial task would solve the remaining disadvan-
of the characteristics. And secondly, we cannot predict           tage of the adaptive methods – the fact that the schema
the behavior of the proposed adaptive strategy, since we          is adapted only once, at the beginning but not in case the
have no information about the structure of contracted             application changes.
graphs of real data. Furthermore, the choice of particu-
lar schema fragments will be strongly related to the type
of the tested data and thus the efficiency of the resulting
storage strategy can vary remarkably.
                                                                  References
    As it is obvious, both the categories are also related         [1] XHTML 1.0 The Extensible HyperText Markup
significantly. And though some particular features can be              Language (Second Edition). W3C Recommen-
estimated or preset according to the existing user-driven              dation, August 2002. http://www.w3.org/TR/
systems and statistical analysis of data, most of them will            xhtml1/.
still require a series of experimental tests.
                                                                   [2] S. Amer-Yahia. Storage Techniques and Mapping
6     Conclusion                                                       Schemas for XML. Technical Report TD-5P4L7B,
                                                                       AT&T Labs-Research, 2003.
The main aim of this paper was to illustrate that since the
idea of database-based XML processing methods is still             [3] S. Amer-Yahia, F. Du, and J. Freire. A Compre-
up-to-date, the techniques should and can be further en-               hensive Solution to the XML-to-Relational Map-
hanced. On this account we have proposed a user-driven                 ping Problem. In WIDM’04: Proceedings of the
mapping algorithm which is able to exploit the user given              6th Annual ACM International Workshop on Web
information, i.e. schema annotations, more deeply and,                 Information and Data Management, pages 31–38,
at the same time, to find the mapping strategy for the                 New York, NY, USA, 2004. ACM Press.
not annotated parts more efficiently – using an adaptive
approach. We have described and discussed the algo-                [4] S. Amer-Yahia and M. Fernandez. Overview of
rithm on theoretical level as well as summed up the cor-               Existing XML Storage Techniques. AT&T Labs-
responding open issues related to particular implementa-               Research, 2001.
tion decisions.
    A natural next step of our work is the implementation          [5] A. Balmin and Y. Papakonstantinou. Storing and
of the proposed algorithm and especially experimental                  Querying XML Data Using Denormalized Rela-
tests of its behavior on real data. For this purpose we                tional Databases. The VLDB Journal, 14(1):30–49,
are enhancing our previous implementation of a fixed                   2005.
XML-to-relational mapping strategy [14], which exploits            [6] E. Bertino, G. Guerrini, and M. Mesiti. A Match-
object-oriented features of XML Schema language to-                    ing Algorithm for Measuring the Structural Simi-
gether with features of object-relational database sys-                larity between an XML Document and a DTD and
tems. For the testing we will exploit information about                its Applications. Inf. Syst., 29(1):23–46, 2004.
categories of real XML data and their typical features
[17] as well as existing related works which could help            [7] P. V. Biron and A. Malhotra.     XML Schema
with tuning the system.                                                Part 2: Datatypes (Second Edition). W3C Rec-
    A possible further enhancing of our approach can be                ommendation, October 2004. www.w3.org/TR/
found in focussing on statistically frequent XML schema                xmlschema-2/.
patterns. They can be used in several ways – e.g. as
a reasonable default setting or as XML patterns with a             [8] P. Bohannon, J. Freire, P. Roy, and J. Simeon.
higher priority.                                                       From XML Schema to Relations: A Cost-based
     Approach to XML Storage. In ICDE ’02: Proceed-        [18] P. K.L. Ng and V. T.Y. Ng. Structural Similarity be-
     ings of the 18th International Conference on Data          tween XML Documents and DTDs. In ICCS’03:
     Engineering, pages 64–75, Washington, DC, USA,             Proceedings of the International Conference on
     2002. IEEE Computer Society.                               Computational Science, pages 412–421. Springer
                                                                Berlin / Heidelberg, 2003.
 [9] T. Bray, J. Paoli, C. M. Sperberg-McQueen,
     E. Maler, and F. Yergeau. Extensible Markup Lan-      [19] A. Nierman and H. V. Jagadish. Evaluating Struc-
     guage (XML) 1.0 (Fourth Edition). W3C Recom-               tural Similarity in XML Documents. In WebDB’02:
     mendation, September 2006. http://www.w3.                  Proceedings of the Fifth International Workshop on
     org/TR/REC-xml/.                                           the Web and Databases, pages 61–66, Madison,
                                                                Wisconsin, USA, 2002.
[10] H. H. Do and E. Rahm. COMA – A System
                                                           [20] E. Rahm and P. A. Bernstein. A Survey of Ap-
     for Flexible Combination of Schema Matching Ap-
                                                                proaches to Automatic Schema Matching. The
     proaches. In VLDB’02: Proceedings of the 28th In-
                                                                VLDB Journal, 10(4):334–350, 2001.
     ternational Conference on Very Large Data Bases,
     pages 610–621, Hong Kong, China, 2002. Morgan         [21] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He,
     Kaufmann Publishers Inc.                                   D. J. DeWitt, and J. F. Naughton. Relational
                                                                Databases for Querying XML Documents: Limita-
[11] D. Florescu and D. Kossmann. Storing and Query-            tions and Opportunities. In VLDB’99: Proceedings
     ing XML Data using an RDMBS. IEEE Data Eng.                of 25th International Conference on Very Large
     Bull., 22(3):27–34, 1999.                                  Data Bases, pages 302–314, San Francisco, CA,
                                                                USA, 1999. Morgan Kaufmann Publishers Inc.
[12] M. Klettke and H. Meyer. XML and Object-
     Relational Database Systems – Enhancing Struc-        [22] M. Smiljanic, M. van Keulen, and W. Jonker. Us-
     tural Mappings Based on Statistics. In Selected pa-        ing Element Clustering to Increase the Efficiency
     pers from the Third International Workshop WebDB           of XML Schema Matching. In ICDEW’06: Pro-
     2000 on The World Wide Web and Databases, pages            ceedings of the 22nd International Conference on
     151–170, London, UK, 2001. Springer-Verlag.                Data Engineering Workshops, pages 45–54, Los
                                                                Alamitos, CA, USA, 2006. IEEE Computer Soci-
[13] J. Madhavan, P. A. Bernstein, and E. Rahm.                 ety Press.
     Generic Schema Matching with Cupid. In VLDB
     ’01: Proceedings of the 27th International Con-       [23] H. S. Thompson, D. Beech, M. Maloney, and
     ference on Very Large Data Bases, pages 49–58,             N. Mendelsohn. XML Schema Part 1: Structures
     San Francisco, CA, USA, 2001. Morgan Kaufmann              (Second Edition). W3C Recommendation, October
     Publishers Inc.                                            2004. www.w3.org/TR/xmlschema-1/.
                                                           [24] W. Xiao-ling, L. Jin-feng, and D. Yi-sheng. An
[14] I. Mlynkova and J. Pokorny.        From XML
                                                                Adaptable and Adjustable Mapping from XML
     Schema to Object-Relational Database – an XML
                                                                Data to Tables in RDB.        In Proceedings of
     Schema-Driven Mapping Algorithm. In ICWI’04:
                                                                the VLDB’02 Workshop EEXTT and CAiSE 2002
     Proceedings of IADIS International Conference
                                                                Workshop DTWeb, pages 117–130, London, UK,
     WWW/Internet, pages 115–122, Madrid, Spain,
                                                                2003. Springer-Verlag.
     2004. IADIS.
                                                           [25] Z. Zhang, R. Li, S. Cao, and Y. Zhu. Similarity
[15] I. Mlynkova and J. Pokorny. Adaptability of Meth-          Metric for XML Documents. In FGWM03: Pro-
     ods for Processing XML Data using Relational               ceedings of Workshop on Knowledge and Experi-
     Databases – the State of the Art and Open Prob-            ence Management, Karlsruhe, Germany, 2003.
     lems. In RCIS’07: Proceedings of the 1st Interna-
     tional Conference on Research Challenges in Infor-    [26] S. Zheng, J. Wen, and H. Lu. Cost-Driven Stor-
     mation Science (to appear), Ouarzazate, Morocco,           age Schema Selection for XML. In DASFAA’03:
     April 2007.                                                Proceedings of the 8th International Conference
                                                                on Database Systems for Advanced Applications,
[16] I. Mlynkova and J. Pokorny. Exploitation of                pages 337–344, Kyoto, Japan, 2003. IEEE Com-
     Similarity and Pattern Matching in XML Tech-               puter Society.
     nologies.    Technical report 2006/13. Charles
     University, Prague, Czech Republic, Novem-
     ber 2006. http://kocour.ms.mff.cuni.cz/
     ~mlynkova/doc/tr2006-13.pdf.

[17] I. Mlynkova, K. Toman, and J. Pokorny. Statis-
     tical Analysis of Real XML Data Collections. In
     COMAD’06: Proceedings of the 13th International
     Conference on Management of Data, pages 20–31,
     New Delhi, India, 2006. Tata McGraw-Hill Pub-
     lishing Company Limited.