Syllable-based Compression
 Syllable-based compression for
                            for XML
                                XML documents
                                    Documents

                Katsiaryna Chernik, Jan Lánský, and Leo Galamboš
                Katsiaryna Chernik, Jan Lánský, and Leo Galamboš
               Charles University, Faculty of Mathematics and Physics
               Charles University, Faculty of Mathematics and Physics
               Malostranske nam. 25, 118 00 Praha 1, Czech Republic
                        Department of Software Engineering
      kchernik@centrum.cz, zizelevak@gmail.com, leo.galambos@mff.cuni.cz
               Malostranské nám. 25, 118 00 Praha 1, Czech Republic
      kchernik@centrum.cz, zizelevak@gmail.com, leo.galambos@mff.cuni.cz

        Abstract. Syllable-based compression achieves sufficiently good results
        on text documents of a medium size. Since the majority of XML docu-
        ments are of that size, we suppose that the syllable-based method can
        give good results on XML documents, especially on documents that have
        a simple structure (small amount of elements and attributes) and rela-
        tively long character data content.
        In this paper we propose two syllable-based compression methods for
        XML documents. The first method, XMLSyl, replaces XML tokens (ele-
        ment tags and attributes) by special codes in input document and then
        compresses this document using a syllable-based method. The second
        method, XMillSyl, incorporates syllable-based compression into the ex-
        isting method for XML compression XMill. XMLSyl and XMillSyl are
        compared with a non-XML syllable-based method and with other exist-
        ing method for XML compression.


 1    Introduction
 The Extensible Markup Language (XML) [5] is a simple text format for struc-
 tured text documents. XML provides flexibility in storing, processing and ex-
 changing data on the Web. However, due to their verbosity, XML documents
 are usually larger in size than other exchange formats containing the same data
 content. One solution to this problem consists of compressing XML documents.
 Because XML is a text format, it is possible to compress XML documents with
 existing text compression methods. These methods are more effective, when
 XML documents have simple structure and long character data content. There
 are different types of text compression: text compression by characters and text
 compression by words. There is also a novel method: text compression by symbols
 [12]. In our work an application of this method to XML documents was devel-
 oped. Since single text compression is not able to discover and utilize the redun-
 dancy in the structure of XML, we modify syllable-based compression method
 for XML.
     At the beginning we supposed that XML syllable-based compression will
 be suitable for middle-sized textual XML documents. There are many XML
 documents that meet these conditions, for example any documentation written
 in DocBook [16] format or news in RSS format [18]. Moreover we suppose that
 our compression would be more suitable for documents in languages with rich
 morphology (for example Czech or German [12]).


V. Snášel, K. Richta, J. Pokorný (Eds.): Dateso 2006, pp. 21–31, ISBN 80-248-1025-5.
22      Katsiaryna Chernik, Jan Lánský, Leo Galamboš


2     Syllable-based compression method
Syllable-based compression [12] is the method where compression is performed
at the syllable level. There are two syllable-based compressors. The first one is
syllable-based LZW, and the second one is syllable-based Huffman.
    Algorithm LZW [11] is a dictionary compression character-based method.
The syllable-based version is called LZWL. In the initialization step, the syllable
dictionary is filled with empty syllable and syllables from a database of frequent
syllables. The following steps are similar with character-based version of LZW,
but LZWL works over an alphabet of syllables.
    The second syllable-based compression method is called HuffSyllable. It is a
statistical compression method based on the adaptive Huffman coding. For our
purposes, we use only LZWL syllable-based compression method. Adaptation of
HuffSyllable for XML compression gave worse results than LZWL.


3     XMLSyl
Our goal was to modify the syllable-compression method to compress XML doc-
uments efficiently. We attempted to modify existing syllable-based compressor
so, that it treats XML tokens (element tags and attributes) as single syllables
instead of decomposing them into many syllables. There were two possibilities
to compel the syllable-based compressor to treat XML tokens as syllables:
 1. Modify parser used in the syllable-based tool and combine it with an XML
    parser, so that it can recognize XML tokens and treat them as a single
    syllable.
 2. Replace XML tokens with bytes in the input document and then compress
    such a document with an existing syllable-based tool.
We decided to implement the second way because this implementation allows
us to make some future improvements easily. For example, we may compel the
syllable-based compressor to assign codes with minimal length to XML tokens
by adding this single bytes to the syllable dictionary[12]. This improvement
is impossible in the first variant. The encoding of XML tokens is inspired by
existing XML compression methods like XMLPPM [3], XGrind [6], XPress [9],
XMill [8].

3.1   Architecture and principles of XMLSyl
The architecture of XMLSyl is shown in Figure 1. It has four major modules: the
SAX Parser, the Structure Encoder, the Containers and the Syllable Compressor.
First, the XML document is sent to the SAX Parser. Next the parser decomposes
document into SAX events (start-tags, end-tags, data items, comments and etc.)
and forwards them to the Structure Encoder.
    The Structure Encoder encodes the SAX events and routes them to the
different Containers. There are three containers in our implementation:
                              Syllable-based Compression for XML Documents          23


        XML Document               SAX Parser

                                Structure Encoder


         Element Container     Attribute Container   Data and Structure Container


       Syllable Compressor      Syllable Compressor       Syllable Compressor


                             Compressed XML document


                  Fig. 1. The Architecture of XMLSylCompressor


 1. Element Container: The Element Container stores the names of all ele-
    ments that occur in an XML document. The Structure Encoder also uses
    the Element Container as the dictionary for encoding XML structure.
 2. Attribute Container: The Attribute Container stores the names of all
    attributes which occur in an XML document. The Structure Encoder also
    uses the Attribute Container as the dictionary for encoding XML structure.
 3. Structure and Data Container: The Structure and Data Container stores
    an XML document, in which all meta-data are replaced with special codes.
    The encoding process is presented in section 3.2.

   When a document is parsed and separated into the containers completely,
the contents of the containers are sent to the Syllable Compressor. It compresses
the content of each container separately using syllable-based compression and
sends the result to the output.
   We have not written the SAX parser by ourselves, rather we have used the
Expat parser[10] which is an open-source SAX parser written in C.


3.2   Encoding the structure of XML document

The structure of XML document is encoded in XMLSyl as follows. Whenever a
new element or attribute is encountered, its name is sent to the dictionary and the
index of the element is sent to the Data and Structure Container. Two different
dictionaries are used for attributes and elements: the Element Dictionary and
the Attribute Dictionary. The Attribute Container operates as the Attribute
Dictionary and the Element Container as the Element Dictionary. Whenever
an end tag is encountered a token END_TAG is sent to the Data and Structure
container. Whenever a character sequence is encountered, it is sent to the Data
and Structure Container without changes. Start and end of character sequences
are indicated by special tokens. We distinguish four different character sequences:
24      Katsiaryna Chernik, Jan Lánský, Leo Galamboš


value of attribute, value of element, comment, and white spaces between tags, if
white spaces are preserved.
   To illustrate the encoding process, consider the encoding of the following
small XML document:

          <book>
            <title lang="en">XML</title>
            <author>Brown</author>
            <author>Smith</author>
            <price currency="EURO">49</price>
          </book>
          <!-- Comment-->

First, the XML document is converted into a corresponding stream of SAX
events:

          startElement("book")
          startElement("title",("lang","en"))
          characters("XML")
          endElement("title")
          startElement("author")
          characters("Smith")
          endElement("author")
          startElement("author")
          characters("Brown")
          endElement("author")
          startElement("price","currency","EURO")
          characters("49")
          endElement("price")
          endElement("book")
          comment("Comment")

    The tokens in the SAX event stream are sent to the Structure Encoder.
It encodes them and sends them to their corresponding containers. When the
book start element token is encountered, the string book is sent to the Element
Container since this element name was not encountered before. An index E0 is
assigned to this entry. This index is sent to the Data and Structure Container.
The same operation is executed for title start element. String title is sent to The
Element Container and an index E1 is assigned to it. The index E1 is sent to
the Data and Structure Container. The element title has the attribute lang. The
attribute name is sent to the Attribute Container and the index A0 is assigned
to it. The index A0 is sent to the Data and Structure Container. Then attribute
value ”en” is sent without modification to the Data and Structure Container.
The ”en” attribute is followed by the token END_ATT, that signals the end of the
attribute value. When an element value such as ”XML” is encountered, the token
CHAR, signaling the beginning of character sequence, the data value and then the
token END_CHAR are all sent to the Data and Structure Container. Finally, all
                           Syllable-based Compression for XML Documents        25


the end tags are replaced by the token END_TAG. When a comment event is
encountered, the code CMNT is put into the Data and Structure Container. The
comment is also sent to the container and is enclosed by END_CMNT code. The
final state of all containers is shown in Figure 2.


                    Element Container
                    element index          Attribute Container
                    book      E0           attribute index
                    title     E1           lang        A0
                    author    E2           currency    A1
                    price     E3

                       Data and Structure Container
    <book> <title lang="en">          XML           </title> <author>
      E0     E1 A0 en END_ATT CHAR XML END_CHAR END_TAG          E2
       Brown       </author> <author>        Smith          </author> <price
CHAR Brown END_CHAR END_TAG     E2    CHAR Smith END_CHAR END_TAG       E3
 currency="EURO">       49        </price> </book>       <!--Comment-->
 A1 Euro END_ATT CHAR 49 END_CHAR END_TAG END_TAG CMNT Comment END_CMNT


                           Fig. 2. Content of containers

   In this example we have ignored white spaces between tags, e.g. <book> and
<title>, so the decompressor then produces a standard indentation. Optionally,
XMLSyl can preserve the white spaces. In that case, it stores the white spaces as
the sequence of characters in the Data and Structure Container between tokens
WS and END_WS.


3.3   Containers

The containers are the basic units for grouping XML data. The Attribute Con-
tainer holds attribute names and the Element Container holds element names.
As long as the number of all element and attribute names in any XML docu-
ment is not high, this two containers are kept in main memory. During parsing,
the containers size increases as the container is filled with entries. Each entry
in the Element container is assigned a byte in the range 00-A9. These bytes
are used for encoding the element names. Each entry in the Attribute container
is assigned a byte in the range AA-F9. These bytes are used for encoding the
attribute names. The residual 6 bytes are reserved for special codes like CHAR,
END_TAG etc. In most cases, 170 (or 80) bytes are enough to encode element (or
attribute) names. If the number of elements (or attributes) are greater than 170
(or 80), entries are encoded with two bytes, then tree and so on.
    There is another situation with The Data and Structure Container. We do not
know the size of the input XML document. The size of XML document can be
so big, that document will not fit into memory, and it is not possible to increase
26      Katsiaryna Chernik, Jan Lánský, Leo Galamboš


the size of container endlessly. Therefore, the container consists of two memory
block of constant size. The content of the first memory block is compressed, as
soon as the container is filled. We don’t compress two blocks at once, because
the context of the second memory block is used for compression of the first one.
After the compression, the compressed content of the first block is sent to the
output and the first block swaps its purpose with the second one. Now the first
block is filled with data. When it is full, the second block is compressed, and so
on.


3.4   The Syllable Compressor

The Syllable Compressor compresses the Structure and Data Container first
and sends the output to the output file. Then the Attribute Containers are
compressed and sent to the output file and finally the same happens with the
Element Container. LZWL is used for the compression of data. HuffSyll could
be also chosen, but the performance is worse, so we decided to use only LZWL.


4     XMillSyl

This chapter introduces our second syllable-based XML compressor, XMillSyl.
This second method incorporates syllable-based compression with the existing
method for XML compression of XMill [8]. XMill has two main principles in
order to optimize XML compression:

 – separating structure from data content, and
 – grouping Data values with related semantics in the same ”container”.

Each data container is then compressed individually with gzip [21]. In XMillSyl,
containers are compressed with LZWL.
   We do not suppose that XMillSyl method gives better results than XMill
because gzip compression performs better than LZWL. We have implemented
XMillSyl in order to compare the power of XMLSyl with the power of two main
principles of XMill.


4.1   Implementation

We did not write XMill compressor. We decided to use existing sources of XMill.
   XMill operates as follows: a SAX parser parses the XML file and the SAX
events are sent to the core module of the XMill called the path processor. It
determines how to map tokens to containers: element tag names and attribute
names are encoded and sent to the structure container, while the data values
are sent to various data containers, according to their semantic. Finally, the
containers are gzipped independently and stored on disk.
   We have modified compression and decompression functions (operating on
containers) in the way they compress and decompress the data containers with
                                       Syllable-based Compression for XML Documents                            27


                 Input XML file                   SAX Parser


                                                Path Processor


      Structure Container
                                         Small
                                    Data Container
                                                               Large
                                                          Data Container 1      … Data Container
                                                                                       Large
                                                                                                 k

                GZip                      GZip                   LZWL                          LZWL

                                             Compressed XML file


                                     Fig. 3. Architecture of XMillSyl


the syllable-based method (see Figure 3). Moreover we have modified the syllable-
based method so that it can work with the containers of XMill implementation
instead of a file stream.
    XMillSyl discerns the difference between small and large containers. Since
LZWL is not suitable for extremely small data, the small containers are com-
pressed with gzip. The structure container is also gzipped in XMillSyl. The large
containers are compressed with LZWL.


                                        Table 1. The first data set.

                    Size      Lang      Description
      elts             103919 English Periodic table of the elements in XML
      pcc           2600257 English Formal proofs transformed to XML
      stats            869059 English One year statistics if baseball players
      tal           1364576 English Safe-annotated assembly language converted to XML
      tpc              313193 English The XML representation of the TPC_D benchmark database.


                    Size      Lan       Description
      errors           153530 English "The Comedy of Errors" marked up as XML
      hamlet           314677 English "The Tragedy of Hamlet, Prince of Denmark" marked up as XML
      antony           289865 English "The Tragedy of Antony and Cleopatra" marked up as XML
      much_ado         220495 English "Much Ado about Nothing" marked up as XML
5    Comparison
     ch00          Experiments
            13916 English "DocBook: The Definitive Guide" in DocBook format (1)
      ch01              55015 English "DocBook: The Definitive Guide" in DocBook format (2)
      ch02             160728 English "DocBook: The Definitive Guide" in DocBook format (3)
To show
     ch03 the effectiveness
                 27799 English of XMLSyll
                               "DocBook:        and Guide"
                                         The Definitive  XMillSyl,     weformat
                                                               in DocBook    compared
                                                                                 (4)   the perfor-
mancech04       137440compressors
       of this two     English "DocBook:with
                                         The Definitive Guide" in DocBook format
                                               one representative            of (6)
                                                                                  XML compressors
XMillch05        67142 English "DocBook: The Definitive Guide" in DocBook format (7)
      and the syllable-based      compressor LZWL [12].
      glossary          24701 English "DocBook: The Definitive Guide" in DocBook format (8)
      howto             42853 English "DocBook V5.0, Transition Guide" in DocBook format.
      hledani           16429 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (1)
      komunikace        50881 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (2)
      navihace          18495 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (3)
      robot             25405 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (4)
      xml               28467 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (5)
      rur1              59609 Czech     "R.U.R" marked up as XML.


                                        V set2 Murkup menshe chem 50% I harakter dannych tekstovyj=>
                                        pokazyvajet horoshije rezultaty.
28          Katsiaryna Chernik, Jan Lánský, Leo Galamboš


5.1    XML data sources
XMLSyl and XMillSyl were tested on two data sets that cover a wide range
of XML data formats and structures. The first data set is shown in Table 1.
It contains English XML documents with different inner structure. It includes
regular data that has regular markup and short character data content (elts,
stats, weblog, tpc). It also includes irregular data, that has irregular markup
(pcc, tall).
    The second data set is shown in Table 2. It contains textual XML documents
of simple structure with long character data content. It contains five stage plays
marked up as XML, four in English and one in Czech. It also contains data in
DocBook format in Czech and in English.
    Some dataSize was distributed
                         Lang           with the XMLPPM [3] and the Exalt [4] com-
                                 Description
pressors
      elts while others   were found
                  103919 English           on of
                                 Periodic table Internet    [15],
                                                  the elements      [16]. All Czech documents use
                                                               in XML
Windows-1250
      pcc         encoding.
                 2600257 English Formal proofs transformed to XML
      stats         869059 English One year statistics if baseball players
      tal          1364576 English Safe-annotated assembly language converted to XML
      tpc           313193 EnglishTable
                                   The XML2. The second
                                           representation of thedata set.
                                                                 TPC_D benchmark database.


                   Size    Lan       Description
      errors        153530 English "The Comedy of Errors" marked up as XML
      hamlet        314677 English "The Tragedy of Hamlet, Prince of Denmark" marked up as XML
      antony        289865 English "The Tragedy of Antony and Cleopatra" marked up as XML
      much_ado      220495 English "Much Ado about Nothing" marked up as XML
      ch00           13916 English "DocBook: The Definitive Guide" in DocBook format (1)
      ch01           55015 English "DocBook: The Definitive Guide" in DocBook format (2)
      ch02          160728 English "DocBook: The Definitive Guide" in DocBook format (3)
      ch03           27799 English "DocBook: The Definitive Guide" in DocBook format (4)
      ch04          137440 English "DocBook: The Definitive Guide" in DocBook format (6)
      ch05           67142 English "DocBook: The Definitive Guide" in DocBook format (7)
      glossary       24701 English "DocBook: The Definitive Guide" in DocBook format (8)
      howto          42853 English "DocBook V5.0, Transition Guide" in DocBook format.
      hledani        16429 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (1)
      komunikace     50881 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (2)
      navihace       18495 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (3)
      robot          25405 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (4)
      xml            28467 Czech     "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (5)
      rur1           59609 Czech     "R.U.R" marked up as XML.


                                     V set2 Murkup menshe chem 50% I harakter dannych tekstovyj=>
                                     pokazyvajet horoshije rezultaty.


5.2    Compression Performance Metrics
The compression ratio is defined as follows:
                                 sizeof (compressed f ile) × 8
                      CR =                                     bits/byte.
                                     sizeof (original f ile)
We compare XMillSyl and XMLSyl compression ratios with those of XMill.
The compression ratio factor shows normalization of the compression ratio of
                                    Syllable-based Compression for XML Documents                 29


                                    Table 3. The first data set.

                           CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl               1
1             elts            1,04  0,47         0,54       1,15      0,72     1,53
2             pcc             0,22  0,02         0,03       1,50      0,04     2,00
3             stats           0,67  0,33         0,40       1,21      0,39     1,18
4             tal             0,36  0,09         0,12       1,33      0,15     1,67
5             tpc             1,82  1,05         1,54       1,47      1,60     1,52
              Average         0,82  0,39         0,53       1,33      0,58     1,58

                           CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl
 1           errors            1,98    1,83       2,00       1,09     1,83       1,00
 2   XMillSyll   or XMLSyl 1,96
             hamlet             with respect
                                       1,91    to 2,00
                                                   XMill. The1,05compression
                                                                      1,85    ratio
                                                                                 0,97factor is
 3   defined antony
             as follows:       1,84    1,79       1,88       1,05     1,69       0,94
 4           much_ado          1,88    1,80          CRXSyl1,05
                                                  1,89                1,77       0,98
                                      CRF        =            .
 5           ch00              3,28    2,69 XSyl 3,00CRXM ill1,12     2,88       1,07
 6           ch01              2,69    2,20       2,43       1,10     2,46       1,12
 7           ch02              1,76    1,43       1,70       1,19     1,57       1,10
 8           ch03              2,90    1,87       2,70       1,44     2,08       1,11
 9           ch04              2,09    1,66       1,78       1,07     1,83       1,10
10           ch05
     5.3 Experimental          2,28
                               Results 1,81       2,03       1,12     2,04       1,13
11           glossary          2,07    1,64       1,84       1,12     1,89       1,15
12   The compression
             howto        ratio6,69
                                 statistics
                                       2,30 of two2,50
                                                    sets of XML
                                                             1,09 documents
                                                                      2,59     are
                                                                                 1,13shown in
13            and Table 4. 3,79
     Table 3 hledani                   3,13       3,62       1,16     3,40       1,09
14       The komunikace
              syllable-based 3,25
                               method2,65performed2,93
                                                     worse on1,11     3,01from the
                                                              documents          1,14first data
15           navihace          3,79    3,14       3,68       1,17     3,44
     set. On the other hand, both XMLSyl and XMillSyl shows great improvement    1,10
16           robot             3,43    2,86       3,22       1,13     3,04       1,06
17
     comparing
             xml
                  to LZWL. 3,74They compressed
                                       3,23
                                                    the input1,14
                                                  3,69
                                                               to 50-60%
                                                                      3,30
                                                                           of the1,02
                                                                                   size of the
18   compressed
             rur1  file with LZWL.
                               2,33    2,07       2,37       1,14     2,15       1,04
         On XML
             Average documents 2,88of the
                                       2,22second 2,51
                                                   data set, 1,13
                                                             LZWL provides
                                                                      2,38      a1,07
                                                                                  reasonably
     good compression ratio - on the average, about two-thirds that of XMill. This
     confirmschour prediction,
                            1,84that 1,61         1,78 compression
                                       syllable-based         1,11      is1,70
                                                                            effective1,06
                                                                                      for textual1,11
             books          1,71      1,79        1,75        0,98        1,66       0,93
     XML documents. Moreover our compression methods show even greater im-
             ch+books       1,80      1,74        1,76        1,01        1,72       0,99
     provement.             3,13      2,63        2,81        1,07        2,93       1,11 0,935943
        On the document of     the second
                            2,83      2,32    data   set, XMillSyl
                                                  2,51        1,08 achieves
                                                                          2,60 about 1,1215%   and
                                                                                            0,924303
     XMLSyl is about 20% better
                            2,78     compression
                                      2,28           ratio than
                                                  2,47           LZWL.2,57
                                                              1,08         Compared  1,13to 0,923077
                                                                                            XMill,
     both methods perform 2,58
                            slightly2,14          2,30
                                        worse. XMillSyl       1,07
                                                           compresses     2,40 13%
                                                                         about       1,12
                                                                                       and0,930435
                                                                                             XML-
                            2,49
     Syl about 7% worse than XMill.   2,15        2,32        1,08        2,34       1,09   0,926724
                            2,40      2,07        2,22        1,07        2,25       1,09 0,932432
        Figure 4 shows the2,30
                             variation1,97
                                          of the 2,17
                                                  compression 1,10
                                                                  ratio as   a function
                                                                          2,15
                                                                                          of XML
                                                                                     1,09 0,907834
     data size for ”DocBook:2,21 The1,90Definitive2,08Guide”.The
                                                              1,09 compression
                                                                          2,06       was run
                                                                                     1,08        on
                                                                                            0,913462
     several subsets. On small
                            2,17 files1,89
                                        XMillSyl2,10
                                                   performs 1,11
                                                              better than 2,03XMLSyl.1,07 The ex- 0,9
     planation is, that the 2,07
                             data are 1,80split into
                                                  2,01 many 1,12          1,93
                                                              small containers       1,07
                                                                                    in      0,895522
                                                                                        XMillSyl,
     which are compressed 1,98
                             with gzip1,73 (gzip 1,93
                                                  outperforms 1,12LZWL,   1,84       1,06on0,896373
                                                                             especially       small
                            1,92      1,68        1,88        1,12        1,79       1,07 0,893617
     data). On middle-sized1,89and large
                                      1,65
                                            files XMLSyl
                                                  1,85
                                                             outperforms
                                                              1,12        1,76
                                                                              XMillSyl.   We    can
                                                                                     1,07 0,891892
     observe that the bigger1,88
                              size also
                                      1,64implies1,83
                                                    a better compression.
                                                              1,12        1,74       1,06 0,896175


     6    Conclusion
     In this work we introduced syllable-based compression tools for XML documents
     called XMLSyl and XMillSyl. We presented the architecture and implementation
     30              CRLZWL CR
          Katsiaryna Chernik, Jan     CRXMillSyl
                                   Lánský,
                                Xmill            CRFXMillSyll CRXMLSyl CRFXMLSyl
                                             Leo Galamboš                                                                                                                                                                                      1
1         elts                                                              1,04   0,47     0,54         1,15                                                                                          0,72                    1,53
2         pcc                                                               0,22   0,02     0,03         1,50                                                                                          0,04                    2,00
3         stats                                                             0,67   0,33     0,40         1,21                                                                                          0,39                    1,18
4         tal                                                               0,36   0,09     0,12         1,33                                                                                          0,15                    1,67
5         tpc                                                               1,82   1,05     1,54         1,47                                                                                          1,60                    1,52
                                                                                 Table 4. The first data set.
          Average                                                           0,82   0,39     0,53         1,33                                                                                          0,58                    1,58

                                                                 CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl
 1        errors                                                    1,98  1,83         2,00       1,09      1,83     1,00
 2        hamlet                                                    1,96  1,91         2,00       1,05      1,85     0,97
 3        antony                                                    1,84  1,79         1,88       1,05      1,69     0,94
 4        much_ado                                                  1,88  1,80         1,89       1,05      1,77     0,98
 5        ch00                                                      3,28  2,69         3,00       1,12      2,88     1,07
 6        ch01                                                      2,69  2,20         2,43       1,10      2,46     1,12
 7        ch02                                                      1,76  1,43         1,70       1,19      1,57     1,10
 8        ch03                                                      2,90  1,87         2,70       1,44      2,08     1,11
 9        ch04                                                      2,09  1,66         1,78       1,07      1,83     1,10
10        ch05                                                      2,28  1,81         2,03       1,12      2,04     1,13
11        glossary                                                  2,07  1,64         1,84       1,12      1,89     1,15
12        howto                                                     6,69  2,30         2,50       1,09      2,59     1,13
13        hledani                                                   3,79  3,13         3,62       1,16      3,40     1,09
14        komunikace                                                3,25  2,65         2,93       1,11      3,01     1,14
15        navihace                                                  3,79  3,14         3,68       1,17      3,44     1,10
16        robot                                                     3,43  2,86         3,22       1,13      3,04     1,06
17        xml                                                       3,74  3,23         3,69       1,14      3,30     1,02
18        rur1                                                      2,33  2,07         2,37       1,14      2,15     1,04
          Average                                                   2,88  2,22         2,51       1,13      2,38     1,07

          ch                                                                1,84                   1,61                              1,78                                1,11                          1,70                    1,06           1,11
          books                                                             1,71                   1,79                              1,75                                0,98                          1,66                    0,93
          ch+books                                                          1,80                   1,74                              1,76                                1,01                          1,72                    0,99
                                                                            3,13                   2,63                              2,81                                1,07                          2,93                    1,11       0,935943
                                                                            2,83                   2,32                              2,51                                1,08                          2,60                    1,12       0,924303
                                         3,25
                                         3,20                               2,78                   2,28                              2,47                                1,08                          2,57                    1,13       0,923077
                                         3,15
                                         3,10
                                         3,05
                                                                            2,58                   2,14                              2,30                                1,07                          2,40                    1,12       0,930435
                                                                            2,49                   2,15                              2,32                                1,08                          2,34                    1,09       0,926724
          Compression Ratio (Bit/Byte)


                                         3,00
                                         2,95
                                         2,90
                                         2,85                               2,40                   2,07                              2,22                                1,07                          2,25                    1,09       0,932432
                                         2,80
                                         2,75
                                         2,70
                                                                            2,30                   1,97                              2,17                                1,10                          2,15                    1,09       0,907834
                                         2,65
                                         2,60                               2,21                   1,90                              2,08                                1,09                          2,06                    1,08       0,913462
                                         2,55
                                         2,50
                                         2,45
                                                                            2,17                   1,89                              2,10                                1,11                          2,03                    1,07
                                                                                                                                                                                                                               LZWL            0,9
                                         2,40
                                         2,35
                                                                            2,07                   1,80                              2,01                                1,12                          1,93                    1,07
                                                                                                                                                                                                                               XMillSyl   0,895522
                                         2,30
                                         2,25                               1,98                   1,73                              1,93                                1,12                          1,84                    1,06
                                                                                                                                                                                                                               XMLSyl     0,896373
                                         2,20
                                         2,15
                                         2,10
                                                                            1,92                   1,68                              1,88                                1,12                          1,79                    1,07
                                                                                                                                                                                                                               XMill      0,893617
                                         2,05
                                         2,00                               1,89                   1,65                              1,85                                1,12                          1,76                    1,07       0,891892
                                         1,95
                                         1,90
                                         1,85
                                                                            1,88                   1,64                              1,83                                1,12                          1,74                    1,06       0,896175
                                         1,80
                                         1,75
                                         1,70
                                         1,65
                                         1,60
                                         1,55
                                         1,50
                                                16149   27481   36483   58149   86991   112592   145116   168453   201423   217322   232569   254382   281915   303018   328481   355426   385967   399688   424093   446610


                                                                                                                    Size (byte)


                                                                Fig. 4. Compression ratio under different sizes.
                           Syllable-based Compression for XML Documents        31


of our tools and tested their performance on a variety of XML documents. In our
experiments, XMLSyl and XMillSyl were compared with LZWL and XMill. Both
methods are more suitable for textual XML documents. XMill outperformed
our methods only marginally. XMLSyl performs better than XMillSyl. It implies
that in our case encoding of XML structure is more efficient than separating a
structure from data and grouping data values with related meaning. XMillSyl
and XMLSyl show better results for Czech language.
    In the future, we want implement some modifications to enhance the com-
pression ratio. For example, the information in the DTD section can be extracted
and utilized to create a special syllable dictionary for elements and attributes.

References
 1. Wilfred Ng, Lam Wai, Yeung James Cheng. Comparative Analysis of XML Com-
    pression Technologies. World Wide Web Journal, 2005
 2. Smitha S. Nair. XML Compression Techniques: A Survey.
    www.cs.uiowa.edu/~rlawrenc/research/Students/SN_04_XMLCompress.pdf
 3. J. Cheney. Compressing XML with Multiplexed Hierarchical PPM Models In
    Proc. Data Compression Conference, 2001.
 4. V. Toman. Compression of XML Data. MFF UK, 2003
 5. World Wide Web Consorcium. Extensive Markup Language (XML) 1.0.
    http://www.w3.org/XML/
 6. P. Tolani, J. R. Haritsa. XGrind: A Query-friendly XML Compressor. In Proc.
    IEEE International Conference on Data Engineering, 2002.
 7. SAX: A Simple API for XML.
    http://www.saxproject.org
 8. H. Liefke, D. Suciu. XMill: an Efficient Compressor for XML Data. In Proc. ACM
    SIGMOD Conference, 2000.
 9. Jun-Ki Min, Myung-Jae Park, Chin-Wan Chung, XPRESS: A Queriable Com-
    pression for XML Data SIGMOD 2003, June 912, 2003, San Diego, CA, 2000.
10. Expat XML Parser.
    http://expat.sourceforge.net
11. T. A. Welch. A technique for high performance data compression. IEEE Computer,
    1984.
12. J. Lansky, M. Zemlicka. Text Compression: Syllables. DATESO, 2005
13. J. Lansky, Slabiková komprese. MFF UK, 2005
14. V. Toman. Komprese XML dat.
    http://kocour.ms.mff.cuni.cz/~mlynkova/prg036/
15. J. Kosek. Inteligentnı́ podpora navigace na WWW s využitı́m XML.
    http://www.kosek.cz/diplomka/, 2002
16. DocBook http://www.docbook.org/
17. A Quick Introduction to XML.
    http://www.cellml.org/tutorial/xml_guide
18. M. Pilgrim. What Is RSS.
    http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html
19. XML Processing.
    http://diveintopython.org/xml_processing/
20. SAX And DOM Overview.
    http://www.jezuk.co.uk/cgi-bin/view/arabica/SAXandDOMIntro
21. The gzip home page.
    http://www.gzip.org/