<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Compression of the Stream Array Data ⋆ Structure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Radim Baˇca</string-name>
          <email>radim.baca@vsb.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Pawlas</string-name>
          <email>martin.pawlas@vsb.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Technical University of Ostrava</institution>
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <fpage>23</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In recent years, many approaches to XML twig pattern query (TPQ) processing have been developed. Some algorithms are supported by a stream abstract data type. Stream is an abstract data type usually implemented using inverted list or special purpose data structure. In this article, we focus on an efficient implementation of a stream ADT. We utilize features of a stream ADT in order to implement compressed stream array and compare it with regular stream array.</p>
      </abstract>
      <kwd-group>
        <kwd>Stream ADT</kwd>
        <kwd>XML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1 Introduction
In recent years, many approaches to XML twig pattern query (TPQ) processing have
been developed. Indexing techniques for a XML document structure have been
studied extensively and works such as [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref3 ref4 ref5 ref8 ref9">11, 10, 8, 1, 3, 9, 4, 5</xref>
        ] have outlined basic
principles of streaming scheme approaches. Node of an XML tree is labeled by a labeling
scheme [
        <xref ref-type="bibr" rid="ref10 ref11">11, 10</xref>
        ] and stored in a stream array. Streaming methods usually use the XML
node tag as a key for one stream. Labels retrieved for each query node tag are then
merged by some type of XML join algorithm such as structural join [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or holistic
join [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        We can use also relational databases in order to store and query labeled XML tree,
however relational query processor join operation is not designed for this purpose. Due
to this fact, XML joins outperform significantly relational query processors [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>XML joins are based on a stream abstract data type which usually implemented
using inverted list or special purpose data structure. In this article, we focus on an efficient
implementation of a stream ADT. We utilize features of a stream ADT in order to
implement compressed stream array and compare it with regular stream array. We utilize fast
fibonacci encoding and decoding algorithms in order to achieve maximal efficiency of
the result data structure. Moreover, our compressed stream array data structure allows
us to store variable length labels such as Dewey order without storage overhead.</p>
      <p>In Section 2, we describe XML model. Section 3 introduce the stream abstract data
type and outline persistent stream array and its compression. In Section 4, we describe
different compression techniques applied on a block of a stream array. Section 5
describes some experimental results.</p>
      <p>XML model
An XML document can be modeled as a rooted, ordered, labeled tree, where every
node of the tree corresponds to an element or an attribute of the document and edges
connect elements, or elements and attributes, having a parent-child relationship. We call
such representation of an XML document an XML tree. We can see an example of the
XML tree in Figure 1. We use the term ’node’ to define a node of an XML tree which
represents an element or an attribute.</p>
      <p>
        The labeling scheme associates every node in the XML tree with a label. These
labels allow to determine structural relationship between nodes. Figures 1(a) and 1(b)
show the XML tree labeled by containment labeling scheme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and dewey order [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
respectively.
      </p>
      <p>The containment labeling scheme creates labels according to the document order.
We can use simple counter, which is incremented every time we visit a start or end tag
of an element. The first and the second number of a node label represent a value of the
counter when the start tag and the end tag are visited, respectively. In the case of dewey
order every number in the label corresponds to one ancestor node.</p>
      <p>" # #
(a)
!
(b)
Holistic approaches use an abstract data type (ADT) called a stream. A stream is an
ordered set of node labels with the same schema node label. There are many options
for creating schema node labels (also known as streaming schemes). A cursor pointing
to the first node label is assigned to each stream. We distinguish the following
operations of a T stream: head(T) – returns the node label to the cursor’s position, eof(T) –
returns true iff the cursor is at the end of T , advance(T) – moves the cursor to the next
node label. Implementation of the stream ADT usually contains additional operations:
openStream(T) – open the stream T for reading, closeStream(T) - close the stream.</p>
      <p>The Stream ADT is often implemented by an inverted list. In this article we
describe simple data structure called stream array, which implement stream ADT. We test
different compression techniques in order to decrease number of disk accesses. It also
allows us to store variable length vectors efficiently.
3.1</p>
      <p>
        Persistent stream array
Persistent stream array is a data structure, which uses common architecture, where data
are stored in blocks on secondary storage and main memory cache keeps blocks read
from the secondary storage. In Figure 2 we can see an overview of such architecture.
Cache uses the least recently used (LRU) schema for a selection of cache blocks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Each block consists of an array of tuples (node labels) and from a pointer to the
next block in a stream. Pointers enable dynamic character of the data structure. We can
easily insert or remove tuples from the blocks without time-consuming shift of all items
in a data structure. Blocks do not have to be fully utilized, therefore we also keep the
number of tuples stored in each block.</p>
      <p>
        Insert and delete operations We briefly describe the insert and delete operations of
the stream array in order to see how the data structure is created. In Algorithm 1 we
can observe how a label is inserted. B.next is a pointer to the next block in the stream.
We try to keep higher utilization of blocks by using similar split technique used by
B+tree [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where we create three 66% full blocks of two full block if possible.
      </p>
      <p>Delete operation is very similar to insert. We process merge of blocks in a case that
their utilization is bellow a threshold. However, this operation is out of scope of this
article.
3.2</p>
      <p>Compressed stream array
There are two reasons for a stream array compression. The first advantage is that we
can decrease the size of the data file and therefore decrease number of disk accesses. Of
Algorithm 1: Insert lT label into the stream array
course, there is an extra time spend on a compression and decompression of data. The
compression and decompression time should be lower or equal to time saved having
less disk accesses. As a result compression algorithm should be fast and should have
good compression ratio. We describe different compression algorithms in Section 4.</p>
      <p>The second advantage is that we can store variable length tuples. Tuples in a regular
stream block are stored in a array with fixed items’ size. The items’ size has to be equal
to the longest label stored in the stream array and we waste quite a lot of space in this
way. Compressed stream block do not use array of items in the block but the byte array
where the tuples are encoded.</p>
      <p>The stream array has a specific feature which enables efficient compression. We
never access items in one block randomly during the stream read. Random access to
a tuple in the block may occur only during the stream open operation, but the stream
open is not processed very often. Therefore, we can keep the block items encoded in
the byte array and remember only the actual cursor position in the byte array. The
cursor is created during the stream open and it also contains one tuple, where we store
encoded label of the current cursor position. Each label is encoded only once during the
advance(T) operation. The head(T) operation only returns the encoded tuple assigned
to cursor. Using this schema we keep data compressed even in the main memory and
have to have only one decompressed tuple assigned to each opened stream.
4</p>
      <p>Block Compression
In following chapters we will describe compression algorithms implemented during our
tests and also we will show examples of these algorithms.
4.1</p>
      <p>Variable length tuple
This compression is only based on fact that we can store variable length tuple. It is done
by saving dimension length with each tuple.
Example 41 Let us have these two tuples: h1, 2i and h1, 2, 3, 7i. When using this
compression they will occupy 6×4 B + 2 B for dimension length for these two tuples. If we
use regular stream array without supporting variable tuple length we will have to align
first tuple, so it will look like h1, 2, 0, 0i and these two tuples will occupy 8×4 B.
4.2</p>
    </sec>
    <sec id="sec-2">
      <title>Fibonacci coding</title>
      <p>This kind of compression is based on Fibonacci coding of number. Because each
dimension of tuple contain only non negative number we can use Fibonacci coding.
Example 42 Let us have a tuple h1, 2, 3, 7i. After encoding the tuple will be stored as
a sequence of bits 11011001101011, which occupy 2 B instead of original 24 B (each
dimension is 4 B length).</p>
      <p>
        The problem for this compression technique might be when tuple contains large
numbers and then compression of the tuple will take more time, because the number is
encoded bit-by-bit. Due to this fact we used the the fast Fibonacci decompression
algorithm, which is described in more details in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This decompression algorithm is faster
because it is working with whole bytes.
4.3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Compression based on reference item</title>
      <p>Tuples in a stream array are sorted and we can use this feature to compress a tuple with
knowledge of his ancestor.</p>
      <p>Common prefix compression Common prefix compression is based on idea of Run
Length Encoding (RLE). Usually ancestor of actual compressing tuple is very similar
and therefore we do not have to store every dimension.</p>
      <p>Example 43 Let us have these tuples: h1, 2, 3, 7, 9, 7i, h1, 2, 3, 7, 5, 6, 7i,
h1, 2, 3, 7, 7, 0, 0, 7i. First tuple in the block cannot be compressed, because there is no
ancestor. Second tuple have to store only 3 dimensions and third one have to store last
4 dimensions. The result after compression looks like: 0 − h1, 2, 3, 7, 9, 7i, 4 − h5, 6, 7i,
4 − h7, 0, 0, 7i, where the first number says how many dimensions are common. In this
example we saved 28 B (original size is 23×4 B, compressed size is (13+3)×4 B).
Fibonacci coding with reference item The Fibonacci code is designed for a small
numbers. However, numbers in the case of containment labeling scheme grows rapidly.
In this case, the Fibonacci code becomes inefficient and compression does not work
appropriately. In order to keep the numbers small we subtract each tuple with its previous
tuple.</p>
      <p>Example 44 Let us have these two tuples: h1000, 200, 300, 7i and h1005, 220, 100, 7i.
From this example we see that we can subtract first 2 dimensions. After subtraction
we will have h1000, 200, 300, 7i and h5, 20, 100, 7i, which are encoded faster and also
occupy less space.</p>
      <p>Experimental results
In our experiments2, we used XMARK 1 data collection and we generated labels for
two different labeling schemes: containment labeling scheme with fixed size of labels
and dewey order labeling scheme with variable dimension length. We tested scalability
of the compression schemes on different collection sizes. We provide test for XMARK
collections containing approximately 200k, 400k, 600k, 800k and 1000k labels. Each
collection contains 512 streams.</p>
      <p>The stream array and all compression algorithms were implemented in C++. We
created one persistent stream array for each collection of labels. We provide set of tests,
where we simulate real work with the stream array and measure the influence of the
compression. For each test we randomly selected 100 streams and read them until the
end. Test is processed with a cold cache. During tests we measured file size, query time
and Disk Access Cost (DAC). Query time include time interval needed for opening
of each randomly selected stream and his reading until the end. DAC is equal to the
number of disk accesses during the query processing.
5.1</p>
      <sec id="sec-3-1">
        <title>Fixed dimension length</title>
        <p>In Figure 3(a) we can see that file size is same for the block without compression and for
the block which support storing variable tuple dimensionality. There is small difference,
but it is only because of supporting variable length of dimension. As we can see in
Figure 3(a) the regular Fibonacci compression can save us about 25 %. Due to the fact,
that the labels values are very close, we can achieve significantly better results in the
case of Fibonacci compression using reference tuple. This kind of compression can save
about 50 % compared to the regular stream array. Common prefix compression saved
us only about 10 %.</p>
        <p>Even thought that the compression ratio is good, the query time for compressed
stream array is a little bit worse than for regular stream array as you can see in Figure
3(b). Disk access cost that we save using the compression is not sufficient in this case
and it is less than the time spend on decompression.
5.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Variable dimension length</title>
        <p>If collection data contains tuples with variable dimension length we can save from 55 %
(only by using block which support variable dimension length) up to 85 % (for
Fibonacci compression with reference item) of file size when comparing to regular stream
array.</p>
        <p>The query time of the compressed stream array is always smaller for every
implemented compression technique than query time of regular stream array as you can see
in Figure 4(b). The Fibonacci compression has the best result for this data collection,
with or without reference tuple. The results are comparable because the labels’ numbers
do not grow so quickly in the case of dewey order labeling scheme.
2 The experiments were executed on an Intelr Celeron rD 356 - 3.33 Ghz, 512 kB L2 cache;
3 GB 533 MHz DDR2 SDRAM; Windows Vista.
1 http://monetdb.cwi.nl/xml/
0
6
0
5
0
1
]sm 04
[
ityem 30
r
e
uQ 02
]kB 3000
[
C
AD 000
2
0
0
0
1
0</p>
        <sec id="sec-3-2-1">
          <title>Fixed length tuple</title>
          <p>Variable length tuple
Fibonacci coding
Fibonacci coding with reference item</p>
          <p>Common prefix compression
0</p>
          <p>0
X200k</p>
          <p>X400k</p>
          <p>X600k</p>
          <p>X800k
In this article we evaluate the persistent stream array compression. The persistent stream
array is designed to implement the stream ADT which support an XML indexing
approaches. We tested two most common types of labeling schemes of XML trees:
containment labeling scheme and dewey order labeling scheme. We performed series of
experiments with different compression techniques. The compression of Containment
labeling scheme is feasible only if we want to decrease the size of data file. The data
decompression time is always higher than the time saved on a DAC, therefore, the query
processing using a compressed stream array is less efficient than the regular stream
array. On the other hand, compressed stream array storing the dewey order labels perform
significantly better than the regular stream array. The best query time is achieved with
the compression technique utilizing the fast fibonacci coding.
0
7
0
6
0
] 5
s
[emm 40
it
ryue 30
0
1</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Fixed length tuple</title>
          <p>Variable length tuple
Fibonacci coding
Fibonacci coding with reference item
Common prefix compression
]
B
k
[
ACD 0006
0
0
0
0
0
1
0
0
0
0
2
0
0
0
4
1
0
0
0
2
0
0
0
X200k</p>
          <p>X400k</p>
          <p>X600k</p>
          <p>X800k</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Al-Khalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          .
          <article-title>Structural Joins: A Primitive for Efficient XML Query Pattern Matching</article-title>
          .
          <source>In Proceedings of ICDE 2002. IEEE CS</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Snasel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Platos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kratky</surname>
          </string-name>
          , and E.
          <string-name>
            <surname>El-Qawasmeh</surname>
          </string-name>
          .
          <article-title>The Fast Fibonacci Decompression Algorithm</article-title>
          .
          <source>Arxiv preprint arXiv:0712.0811</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Bruno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          .
          <article-title>Holistic Twig Joins: Optimal XML Pattern Matching</article-title>
          .
          <source>In Proceedings of ACM SIGMOD</source>
          <year>2002</year>
          , pages
          <fpage>310</fpage>
          -
          <lpage>321</lpage>
          . ACM Press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tatemura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Hsiung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Candan</surname>
          </string-name>
          . Twig2Stack:
          <article-title>Bottom-up Processing of Generalized-tree-pattern Queries Over XML documents</article-title>
          .
          <source>In Proceedings of VLDB 2006</source>
          , pages
          <fpage>283</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shanmugasundaram</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Index Structures for Matching XML Twigs Using Relational Query Processors</article-title>
          .
          <source>In Proceedings of ICDE 2005</source>
          , pages
          <fpage>1273</fpage>
          -
          <lpage>1273</lpage>
          . IEEE CS,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Comer</surname>
          </string-name>
          .
          <article-title>Ubiquitous b-tree</article-title>
          .
          <source>In ACM Computing Surveys</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          . ACM Press, June,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Database systems: the complete book</article-title>
          .
          <source>Prentice Hall</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Grust</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Keulen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          . Staircase Join:
          <article-title>Teach a Relational DBMS to Watch Its (Axis) Steps</article-title>
          .
          <source>In Proceedings of VLDB 2003</source>
          , pages
          <fpage>524</fpage>
          -
          <lpage>535</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ooi.</surname>
          </string-name>
          XR-Tree:
          <article-title>Indexing XML Data for Efficient</article-title>
          .
          <source>In Proceedings of ICDE</source>
          ,
          <year>2003</year>
          , India. IEEE,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. I. Tatarinov and at al.
          <article-title>Storing and Querying Ordered XML Using a Relational Database System</article-title>
          .
          <source>In Proceedings of ACM SIGMOD</source>
          <year>2002</year>
          , pages
          <fpage>204</fpage>
          -
          <lpage>215</lpage>
          , New York, USA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Lohman</surname>
          </string-name>
          .
          <article-title>On Supporting Containment Queries in Relational Database Management Systems</article-title>
          .
          <source>In Proceedings of ACM SIGMOD</source>
          <year>2001</year>
          , pages
          <fpage>425</fpage>
          -
          <lpage>436</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>