<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>T a i l o r - M a d e N a t i v e X M L S t o r a g e S t r u c t u r e s</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karsten Schmidt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theo Härder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AG DBIS, Department of Computer Science University of Kaiserslautern</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>96</fpage>
      <lpage>106</lpage>
      <abstract>
        <p>Automatically choosing suitable native storage structures for XML documents arriving at the XML DBMS is a challenging task for its storage manager. While some of the critical parameters require pre-specification, others can be determined by pre-analysis or sampling of the incoming document or by just making experience-driven “educated guesses”. In this paper, we discuss approaches to achieve an adaptive behavior of the storage manager to provide tailor-made native XML storage structures to the extent possible.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The XML data model allows for a great deal of modeling flexibility with which order,
optional and multi-valued concepts, as well as deeply nested relationships can be captured.
Therefore, the resulting tree structures of natively represented XML documents exhibit
large variations in breadth and depth. Furthermore, XML documents may be data-centric
having relatively little ’content’ stored as short text values or they may be
document-centric where a single content node may contain huge portions of text, e.g., an entire book in
a digital library.</p>
      <p>For these reasons, it is highly advisable to equip the storage manager with the ability to
select reasonable or even optimal storage structures for documents. Furthermore, two
cases have to be considered: a document sent by a client arrives at a time—the so-called block
mode, i.e., the document can be pre-parsed and analyzed before a storage structure is
chosen—or it arrives at the DBMS interface in stream mode where fragments present, due to
their size, have to be allocated in a suitable storage structure on disk before the entire
“streamed” document is available for the DBMS. In the latter case, the storage manager
must decide—based on imprecise structural information—on the storage structure to be
chosen and, at best, can make some educated guesses based on context or sampling
information.</p>
      <p>To identify suitable XML storage structures, we describe the most important concepts and
their critical issues in Section 2. In Section 3, we explore methods for content compression,
before we show in Section 4 how to collect storage parameters in an analysis phase. These
concepts and methods are applied and empirically evaluated on tailor-made storage
structures for well-known sample documents in Section 5, before we wrap up with conclusions.
1 This work has been supported by the Rheinland-Pfalz cluster of excellence “Dependable adaptive systems and
mathematical modeling” (see www.dasmod.de).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Essential Concepts for the Storage Manager</title>
      <p>
        Currently, we upgrade XTC [5], our native XML DBMS (XDBMS for short), step by step
towards enhanced adaptivity. As far as physical document handling is concerned, the
abilities of our adaptive storage manager (ASM) primarily rest on a few important concepts.
First of all, when storing and processing XML documents in a native way, the node
labeling scheme used plays a critical role for many XDBMS tasks. Internal evaluation of
navigational (DOM) or declarative requests (XPath, XQuery) as well as concurrency control
on large tree structures should be effectively supported to achieve efficient
transactionprotected processing or cooperative use of XML documents under a variety of XML
language models. Furthermore, indexed access can also take advantage of a suitable node
labeling technique [8]. When carefully optimized, storage consumption of documents and
related index structures can be limited to an acceptable level. In summary, it is key for the
flexibility and performance of the entire internal system behavior. These observations
were confirmed by a systematic evaluation with many experiments and benchmark runs
[4, 6]. As result of our learning process, we recommend prefix-based labeling schemes.
Because any prefix-based scheme such as OrdPaths [12], DLNs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or DeweyIDs [4] is
appropriate for our document storage, we use SPLIDs (Stable Path Labeling IDentifiers)
as synonym for all of them.
      </p>
      <p>Together with prefix-based labels, a so-called path synopsis can provide substantial
processing and compression support in an XDBMS. In an XML document, the structure part
(element/attribute nodes) typically carries a large amount of redundancy due to the verbose
structural description; this is also true when the element/attribute names are replaced by
VocIDs used from a vocabulary. The often huge degree of path repetitions (e.g., /bib/book/
chapter/author) is not reduced by this standard format of XML document storage.
Therefore, we try to get rid of this redundancy while preserving all operational properties of the
document representation. All paths from the root to the leaves having the same sequence
of element/attribute names form a path class. All path classes of an XML document can be
captured in a typically small main-memory data structure called path synopsis [8]. Besides
keeping statistical data, it is used to detect path patterns, structure characteristics, or to
recommend the creation of indexes. Moreover, a given synopsis can be compared to existing
document synopses to find documents with similar structure.</p>
      <p>An important use, the path synopsis enables the derivation of the entire leaf-to-root path
of a content node. For example, when a value in a content index—whose unique position
in the document is identified by its SPLID—is associated with a reference to its path class,
it is easy to reconstruct the specific instance of the path class it belongs to. By numbering
the path classes in the path synopsis, we achieve an effective path class reference (PCR)
serving as a path class encoding. Even an index reference via a SPLID to a structure node
(attribute/element) allows the reconstruction of the referenced node’s ancestor path, when
a PCR added to the index reference. This usage of the path synopsis indicates its central
role in all structural references and operations [8].
1.3.3.3
1.3.3.3.1
1.3.5
1.3.5.3
1.</p>
      <p>1.3.1.3.1</p>
      <p>1.3.5.3.3 1.3.5.5.3.1
Efficient processing of dynamic XML documents requires arbitrary node insertions
without re-labeling, maintenance of document order, variable-length node representation,
representation of long fields, and indexed access. As sketched in Figure 1, the document index
enables direct access of a node when its SPLID is given. Together with the document
container, the document store represents a B*-tree which takes care of external storage
mapping and dynamic reorganization of the document structure. Combined with SPLID use, it
embodies our basic implementation framework to satisfy the above demands efficiently.
In XTC, this base structure comes with a variety of options [5] concerning use of
vocabularies, materialized or referenced storage of content (in leaf nodes), and, most important,
prefix-compressed SPLIDs. As illustrated in Figure 1, the sequence of SPLIDs in
document order lends itself to prefix compression and, indeed, we received impressive results
in numerous empirical experiments [4].</p>
      <p>A final issue to be considered when storing XML documents is the benefit of saving
storage space by applying suitable compression techniques. Two potential areas where
compression can be successfully used are the XML structure itself and its content. In contrast
to the relational world, where typically column-based compression is used, the storage
representation of XML paths and their uncorrelated sequence of element/attribute names
complicate “simple” path-based compression algorithms such as XMill [7]. Furthermore,
transactional modification applied to XML documents prevents block-based compression
used by PPM algorithms [11]. Note, it does not seem to be helpful to separate content and
structure, only to enable the concatenation of smaller values to larger text blocks and, in
this way, to achieve better compression results. Such an approach would involve a
complete cycle of de- and re-compression when a specific node value is modified. Thus, to
avoid undue limitations and overhead of XML processing, compression of single node
values seems to be an appropriate and challenging choice. In our compression study, we
exclusively focus on single nodes and their data stemming either from text content or
attribute values and use character-based compression algorithms.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Empirical Results for XML Content Compression</title>
      <p>To provide some indicative results for the storage of XML documents in various formats,
we have applied a number of empirical tests in the context of our ASM where we used—
to facilitate comparison—the frequently evaluated set of test documents taken from [10].
For a representative subset, Table 1 assembles essential characteristics of the documents
in the “gross” format, i.e., as they arrive at the DBMS.</p>
      <p>Because content compression is orthogonal to the question how the overall document is
stored, we explore the compression efficiency to be gained irrespective of how it is used
by the ASM. Thus, we restrict ourselves to tests using four encoding methods:
• Fixed Huffman (M1): During a preanalysis run, the character frequencies were collected
on a document basis, for which a Huffman tree was then constructed. To adjust for later
document modifications, encoding was provided for all 256 possible characters.
• Flexible Choice (M2): Depending on the characteristics of a content node, it is either
encoded by a document-wide Fixed Huffman or a tailor-made node-wide Huffman. If
chosen, the tailored Huffman tree (typically &lt;40 nodes) is stored together with the
encoded node’s content.
• Selective Encoding (M3): This method optimizes the runtime of M2 by calculating and
applying tailored Huffmans only to longer text/attribute values; smaller text values
were encoded by the Fixed Huffman of the document.
• Domain Encoding (M4): Especially applicable to small documents, an overall encoding
constructed from a domain-related character distribution base is used to reduce space
requirements and to speed up compression time.</p>
      <p>M1 uniformly encodes all content nodes of the document, even if compression leads to
larger encoded equivalents, for example in case of short values. To have more flexible
encoding options by selecting among different algorithms, one or more decision bits are
needed for each record to indicate its compression state. However, the savings for multiple
encodings may be compensated by such decision bits as shown by our experiments,
because often byte alignment consumed additional space. Moreover, each compressed bit
string had to be aligned to the next byte boundary to observe the record’s byte format. With
this increased flexibility, M2 computes a tailored Huffman for every content where
applicable. However, this leads to increased computation time and one decision bit for every
record to distinguish the type of compression (fixed or tailored). For this reason, M3 tries
to reduce these extensive computation times by a threshold heuristically chosen to decide
whether or not a tailored Huffman has to be derived. As expected, this leads to shorter
compression times thereby retaining acceptable compression ratios. Method M4 applies
this idea to several (small) documents to avoid the creation of a per document encoding
table which may be larger than the document to be encoded (see Section 5.2).
In the following, all results are compared against Standard format where the document
nodes are stored as variable-length records including SPLIDs, VocIDs, type descriptors,
byte alignment, etc. Our compression experiments only considered content nodes with text
or attribute values. To give some coarse hints about the potential compression gain, we
have captured the avg. value sizes of content (text and attribute) nodes2 in Table 2 where
our experiments are summarized: M1 is fast and delivers considerable compression ratios
varying from ~25% to ~35% on all documents. Note, M1 compression reduces the overall
storage time up to ~15% compared to Standard, because less I/O is needed to store the
corresponding document on disk. M2 and M3 consume substantially more computing time for
the compression (not shown in Table 2). Furthermore, they exhibit no or only tiny
compression improvements on the kind of documents in our reference collection. For this
reason, M1 is our recommended and preferred compression method for large documents.
Looking at the columns ’avg. value length per text/attribute/content node’, most of the
documents in Table 2 can be classified as “data centric” where a cheap M1-type algorithm
seems to be an adequate compression solution. In contrast, so-called “document-centric”
XML structures would embody a greater potential for compression which could be
exploited by the M2/M3-type algorithms, because larger portions of text fill the content
nodes. In our collection of reference documents, only nasa has ’weak’ document-centric
properties. In truly document-centric XML structures, substantially greater compression
gains may be anticipated by applying tailored compression schemes [7, 13].
4</p>
    </sec>
    <sec id="sec-4">
      <title>Collecting Document Parameters</title>
      <p>Whether or not text compression has to be applied, it is typically pre-specified and may
depend on update frequencies as well as availability or cost of resources (storage space,
CPU). If compression is desired, ASM can easily decide on a suitable method which may
2 The effectiveness of compression relies on many factors, not only content size. Although lineitem has only
small content nodes, they primarily contain digits. This ’narrow’ alphabet allows relatively better compression
rates as treebank containing larger enciphered texts nodes resulting in a ’broader’ alphabet.
be selected based on domain-dependent characteristics. For further parameters concerning
physical document representation, some dynamic approximations can be derived.
Therefore, two cases have to be distinguished: block-mode and stream-mode arrival of (large)
XML documents. In any case, an as thorough as possible analysis of the incoming
document seems mandatory to accomplish highly optimized storage representations for XML
documents. To give an impression of the analysis task needed, we refer to Table 1.
4.1</p>
      <sec id="sec-4-1">
        <title>Dynamic Document Analysis</title>
        <p>To enable adaptive decisions, ASM may scans the available fragment of the document, in
case of block-mode arrival often the entire document, and collect significant document
parameters to adjust an initial default parameter setting. Statistical data may
include—besides the degree of document-centric compression behavior—number of nodes (i.e.,
element, attribute, and text nodes), maximum depth and average depth, various fan-out ratios,
number of distinct element names, as well as number of distinct paths per path class.
Furthermore, the size of text nodes helps to adjust some storage parameters and the size of the
document store can be estimated from such statistical data.</p>
        <p>A vocabulary is essential for saving document storage space by encoding the element and
attribute names, e.g., using one-byte or two-byte integers as VocIDs. It can be represented
by a little main-memory data structure (for typically a few hundred names). Therefore,
while scanning the document, its vocabulary is incrementally built. Furthermore, the path
synopsis containing all path classes and PCRs is derived. Both data structures can be
completed on block-mode arrival, whereas a stream-mode document may leave at the end of
the analysis phase fragmentary data structures to be completed in later phases.
Another problem is unused space in container pages which is crucial when text nodes have
to be allocated. They are materialized in container pages up to a parameterized
max-valsize. When the size exceeds max-val-size; the text is stored in referenced mode possibly
divided in parts each stored into a single page and reachable via a reference from its home
page. Hence, maximizing page utilization may be achieved by a document-dependent page
size optimization. Regarding the document store as an index, these findings can be applied
to additional indexes, too.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Approximating Document Parameters by Sampling</title>
        <p>To check a conceivable reduction of the analysis effort, we ran some sampling experiments
for the documents of Table 1. Sampling only allows to approximate important auxiliary
structures and configuration parameters such as vocabulary, path synopsis, content size per
node, fan-out (in upper levels), and document depth. In a number of sampling experiments,
we have determined—for the parameters applicable—the ranges of the estimation errors
to be expected. In Figure 2a, the graphical symbols depict the average estimation error per
document, whereas the max/min of the error range correspond to estimations computed
when a sampling buffer was filled with 1 resp. 50 MB. These results highlight one of the
most fundamental problems in sampling small pieces of documents with heterogeneous or
a) Ranges of sampling errors on XML documents b) Extrapolated sampling errors when doc siz is known
% s
5 5 1860 sn sn s</p>
        <p>n
n</p>
        <p>G; n 5
dblp nasa</p>
        <p>n
42 GK5; GK;5 GK;5 GK;5 sGK;5n
0 5 10 15 20 25 30 35 40 45 50</p>
        <p>% of total document sampled
voc size n
path classes 5
lineitem G
uniprot ;</p>
        <p>dblp
psd7003
n
5</p>
        <p>nasa s
treebank K
skewed structures. As indicated for dblp and treebank in Figure 2a, parameters such as
vocabulary size and path class may cause the selection of an unfit representation model.
When we start building the document with such “wrong guesses”, we may get suboptimal
structures or may be enforced to revise our design decision. In general, however, the
parameters for max/avg depth, average text size, and fan-out are accurate and stable, even for
tiny fractions of 1 MB samples. Hence, we can use stable parameters for decisions
concerning SPLID encoding and page size tuning. Of course, a Fixed Huffman can be derived
by sampling, too. Because character distribution and their frequencies typically are domain
dependent, nearly optimal encodings can be expected. Stream-mode documents
necessarily enforce ASM to configure storage structures with less than perfect parameter
knowledge, as characterized in Figure 2a. Because file size information is available for
blockmode documents, extrapolation of some parameters using the size of the entire document
is applicable. To show the precision of a sampling step instead of a full scan for size
information (number of attribute/element/text nodes), Figure 2b exhibits the relative estimation
errors for various sample sizes. Surprisingly, our results reveal that, even with only a 1%
sample, an error of not more than ~10% may be expected. Of course, larger sample sizes
improve this error margin. Figure 2b also shows that there exist “simply structured”
documents where sampling delivers perfect knowledge of size parameters even using very
small samples. However, nasa exemplifies what happens when sampling and
extrapolating unbalanced documents, its error is bounded to ~12%. In summary, sampling often
delivers accurate-enough parameters for ASM to plan the physical configuration of an XML
document.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Building Tailor-Made Storage Structures</title>
      <p>So far, we have outlined the essential concepts and the critical parameters for XML
document storage. In our empirical study, we focus on the variability and optimization of
storage structures which can be chosen by the DBMS for incoming documents. The question
which secondary element/attribute or text indexes should be provided is orthogonal to the
choice of the native document structure and has to be answered w.r.t. the expected
workload; how indexes can be built, is discussed in [8]. For the support of index access and
virtualization of the document index, a path synopsis is provided together with the document
container, where applicable. Here, we primarily want to illustrate how much storage
consumption can be reduced by deriving configurations tailored to the parameters of the
documents. An important optimization is the use of tailored SPLIDs, which was applied in all
experiments. Note, in all cases neither set-oriented operations (e.g., XQuery) nor
node-oriented navigational operations (e.g., DOM) are restrained or impeded.
5.1</p>
      <sec id="sec-5-1">
        <title>Storing Single Documents</title>
        <p>Single large documents are assigned to separate storage structures. The obvious choice
denoted as Model 1 is the physical representation of all structure and content nodes in a set
of container pages together with a document index. The corresponding schema shown in
Figure 1 also reveals that the SPLIDs occur in sort order and that they lend themselves to
a very effective prefix compression [4].</p>
        <p>In most cases, the set of structure nodes contains a dramatic degree of replication within
the path instances of a given path class. Therefore, applying SPLIDs for node labeling in
the document, each path instance the node belongs to can be efficiently derived using the
path synopsis. This observation leads to the opportunity to virtualize the document
structure. For this purpose, we design Model 2 as a modification of Model 1 by optimizing the
layout of the container pages in Figure 1 using an “elementless” XML representation while
preserving all document properties. Only the values of leaf nodes together with their
SPLIDs and PCRs, is stored in the container pages. The PCRs are used to derive the inner
document structure on demand [8].</p>
        <p>At first, we focus on the minimization % 1
scouoflntthsteeinnsttFrnuiocgdtuuersreeu3pnaacrrotemonnpolryremsasnaelddiz.leeTdavhwee.trrh.ete-. 7800 leodM led2
the Standard format using plain 60 oM
SPLIDs and “long” VocIDs (2 bytes). 50
Model 1 stores all structure and
content nodes with prefix-compressed 40
SPLIDs and VocIDs (1 byte). Model 2 30
only stores the content nodes carrying
prefix-compressed SPLIDs and ad- 20
justed PCRs (only 1 byte, if applica- 10
sbalev)e.sC~o2m0p%are–d t~o4S0t%andaanrdd,MMooddeell 21 0 unpir-ot trbeaen-k ps7d0-03 liintee-m dblp nasa
reaches ~30% – ~50% when consider- content VocIDs + admin
ing the entire document. If we regard compressed SPLIDs PCRs + admin
the structural part only, the saving in- Figure 3 Storage consumption of XML documents
creases substantially to ~40% – ~61%
and ~69% – ~86% for Model 1 and Model 2, respectively.
The second group of experiments explores the potential of content compression (discussed
in Section 3). Due to space limitations we omit a detailed discussion of the results and
restrict ourselves to a summary. Using the simple compression method M1 on our reference
collection of documents, the relative share of content nodes is reduced from ~28% – ~59%
to ~19% – ~37% as compared to Standard. Concerning the entire document, compressed
Model 1 reduces the storage consumption to less than 65%, whereas compressed Model 2
reaches less than 50% in all cases. Model 2 assumes a reasonably small path synopsis
available for all processing steps which is true for probably more than 90% of the XML
documents [9]. Looking at the parameter for path classes in Table 1, treebank, however,
can be characterized as such an exotic outlier (maximum depth of 37), where only Model
1 should be applied.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Collections of Documents</title>
        <p>To handle homogeneous documents that are somehow inter-related, we introduce
collection assembling documents which embody similarities w.r.t. structure, domain affiliation,
and vocabularies. Model 1 and 2 disclose tremendous drawbacks storing a large number
of small documents in separate container pages. In addition, separately stored auxiliary
structures would consume an enormous number of weakly filled pages. To avoid such low
storage occupancy, small and large documents are physically stored together as subtrees
under a virtual root node in a collection which, in turn, may be indexed by a B*-tree.
Homogeneous documents by structure and/or by content may take advantage of common
structure/content indexes and gain better storage utilization and query support. Our
collection idea is endorsed by an analysis of typical documents and varieties in the recently
proposed TPoX [14] benchmark which serves as an illustration for our concepts. In general,
our collection approach attempts to exploit Model 2 (“elementless”) storage and to process
distorted documents in an appropriate way. For this reason, we apply, when appropriate,
multiple path synopses within a single collection definition, as shown in Figure 4a. Every
document, stored according to Model 2, is assigned to the closest matching path synopsis,
to a new path synopsis, or, when Model 2 is not smoothly applicable, it is stored according
to Model 1, possibly without a path synopsis3. Such a dynamic assignment is sketched in
Figure 4b while referring to the various TPoX structures. This XML benchmark proposal
consists of three different groups of documents: one for Account data, one for Order data
and one for Security attachments of customers. Designed for extreme scalability up to
PBytes, the smallest benchmark configuration already provides more than 80,000
documents having plain sizes between 1 and 18 KBytes. Even without an existing XML
schema, an allocation of new documents would be possible—because of the structural
simplicity of these documents—only by path synopsis matching. In the path synopses in Figure 4,
simple top-down comparison suffices to determine—using an adequate similarity
threshold—the collection membership. For the TPoX documents, the structural information
within each group is confined to a maximum of 139 path classes and to 139 VocIDs to
en3 Note, a path synopsis is mandatory for elementless structures to regain the internal document structure. It may
be helpful to optimize specific operations for documents represented by Model 1, too.
Metadata
PS1</p>
        <p>PS2
code the documents’ node names. Including a number of common structural elements, the
resulting TPoX collection is confined to 295 distinct path classes and a vocabulary of 276
elements. Obviously, such database-driven collections remain quite stable, because
schema evolution may only occur in exceptional cases and, thus, the path synopsis will
preserve its reasonably small size. The mapping of path synopses to documents and vice versa
can be handled by a simple lookup table in the system catalog.</p>
        <p>By combining the physical storage of a collection, it is an adequate design decision to use
indexes for holistic query support over all documents. Typically, a collection forms a
uniform database and queries refer to all of its documents at the same time. Therefore, if a user
would enforce a document with totally incompatible characteristics into an existing
collection, existing path synopsis and index accuracy would be inflated by structural information
alien to the collection’s infrastructure. Thus, our ASM in XTC always tries to determine a
suitable existing collection by path synopsis matching and vocabulary comparison. If all
matches violate a reasonable threshold, an incoming document is considered
heterogeneous to all existing collections. To preserve their beneficial storage and processing
properties, the best decision in such a case is to store it as a singleton according to Model 1 or 2.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we primarily discussed important concepts needed to obtain optimal and
tailor-made storage structures for XML documents. Furthermore, we elaborated the potential
benefit of compression methods applied to content nodes. For block-mode and
streammode document arrival, we sketched the kind of analysis needed to identify structure
parameters for optimal and, if possible, automatic selection of a storage model. Our
performance measures indicate the potential storage saving and operational gain using such
concepts of adaptivity.
What does this saving achieved by structure encoding and content compression mean? For
example, assume uniprot: the gross document (in text format) arriving at the DBMS has
1820 MBytes. A straightforward encoding (VocIDs for the element/attribute names,
uncompressed content, added node labels), here denoted as Standard, results in 1685
MBytes. Our optimizations obtain for the compressed Model 1 and Model 2 ~1060 and
~825 MBytes, respectively. Using any of these models, all declarative or navigational
operations can be applied with the same or improved speed. Even when storing compressed
contents, the use of indexes does not provide any problem. Content-and-structure (CAS)
queries are particularly efficient, because our way of processing the queries [8] can often
avoid expensive structural joins or twig evaluation and can derive the path information
from the combined use of SPLIDs and path synopsis. For specific CAS queries supported
by content indexes, up to two orders of magnitude response-time reduction compared to
traditional approaches were achieved with our XTC prototype DBMS [5].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]</source>
          [15]
          <string-name>
            <surname>Böhme</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Supporting</surname>
          </string-name>
          <article-title>Efficient Streaming and Insertion of XML Data in RDBMS</article-title>
          .
          <source>Proc. 3rd Int. Workshop Data Integration over the Web (DIWeb)</source>
          , Riga, Latvia,
          <fpage>70</fpage>
          -
          <lpage>81</lpage>
          (
          <year>2004</year>
          )
          <article-title>Bruno</article-title>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , and
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Holistic Twig Joins: Optimal XML Pattern Matching</article-title>
          .
          <source>Proc. SIGMOD</source>
          :
          <fpage>310</fpage>
          -
          <lpage>321</lpage>
          (
          <year>2002</year>
          )
          <article-title>Goldman</article-title>
          ,
          <string-name>
            <given-names>R.</given-names>
            , and
            <surname>Widom</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases</article-title>
          .
          <source>Proc. VLDB</source>
          :
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          (
          <year>1997</year>
          ) Härder,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Haustein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mathis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Node Labeling Schemes for Dynamic XML Documents Reconsidered</article-title>
          .
          <source>Data &amp; Knowl. Engineering</source>
          <volume>60</volume>
          :
          <fpage>1</fpage>
          ,
          <fpage>126</fpage>
          -
          <lpage>149</lpage>
          (
          <year>2007</year>
          ) Haustein,
          <string-name>
            <given-names>M. P.</given-names>
            ,
            <surname>Härder</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>An Efficient Infrastructure for Native Transactional XML Processing</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>61</volume>
          :
          <fpage>3</fpage>
          ,
          <fpage>500</fpage>
          -
          <lpage>523</lpage>
          , Elsevier, (
          <year>2007</year>
          ) Haustein,
          <string-name>
            <given-names>M. P.</given-names>
            ,
            <surname>Härder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Mathis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>DeweyIDs - The Key to FineGrained Management of XML Documents</article-title>
          , in
          <source>: Proc. 20th Brasilian Symposium on Databases, Oct</source>
          .
          <year>2005</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Liefke</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>XMill: an Efficient Compressor for XML Data</article-title>
          .
          <source>Proc. SIGMOD</source>
          :
          <fpage>153</fpage>
          -
          <lpage>164</lpage>
          (
          <year>2000</year>
          ) Mathis,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Härder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            , and
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Storing and Indexing XML Documents Upside Down, submitted</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Mignet</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Veltri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>The XML Web: a First Study</article-title>
          .
          <source>Proc. 12th Int. WWW Conf</source>
          .,
          <string-name>
            <surname>Budapest</surname>
          </string-name>
          (
          <year>2003</year>
          ), www.cs.toronto.edu/~mignet/Publications/www2003.pdf Miklau,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>XML Data Repository, www</article-title>
          .cs.washington.edu/research/xmldatasets Ng,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Lam</surname>
          </string-name>
          , W. Y., and Cheng, J.
          <source>Comparative Analysis of XML Compression Technologies. World Wide Web</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2006</year>
          )
          <string-name>
            <given-names>O</given-names>
            <surname>'Neil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            ,
            <surname>O'Neil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            ,
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Cseri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Schaller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Westbury</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>ORDPATHs: Insert-Friendly XML Node Labels</article-title>
          .
          <source>Proc. SIGMOD</source>
          :
          <fpage>903</fpage>
          -
          <lpage>908</lpage>
          (
          <year>2004</year>
          )
          <article-title>Toman, V. Syntactical Compression of XML Data. caise04dc.idi.ntnu.no/CRC_CaiseDC/toman.pdf XML Database Benchmark: Transaction Processing over XML (TPoX</article-title>
          ), http://tpox.sourceforge.net/ (
          <year>January 2007</year>
          ) Zhang, N.,
          <string-name>
            <surname>Özsu</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aboulnaga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I. F.</given-names>
          </string-name>
          <string-name>
            <surname>XSEED</surname>
          </string-name>
          <article-title>: Accurate and Fast Cardinality Estimation for XPath Queries</article-title>
          .
          <source>Proc. ICDE</source>
          <year>2006</year>
          :
          <volume>61</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>