Syllable-based Compression Syllable-based compression for for XML XML documents Documents Katsiaryna Chernik, Jan Lánský, and Leo Galamboš Katsiaryna Chernik, Jan Lánský, and Leo Galamboš Charles University, Faculty of Mathematics and Physics Charles University, Faculty of Mathematics and Physics Malostranske nam. 25, 118 00 Praha 1, Czech Republic Department of Software Engineering kchernik@centrum.cz, zizelevak@gmail.com, leo.galambos@mff.cuni.cz Malostranské nám. 25, 118 00 Praha 1, Czech Republic kchernik@centrum.cz, zizelevak@gmail.com, leo.galambos@mff.cuni.cz Abstract. Syllable-based compression achieves sufficiently good results on text documents of a medium size. Since the majority of XML docu- ments are of that size, we suppose that the syllable-based method can give good results on XML documents, especially on documents that have a simple structure (small amount of elements and attributes) and rela- tively long character data content. In this paper we propose two syllable-based compression methods for XML documents. The first method, XMLSyl, replaces XML tokens (ele- ment tags and attributes) by special codes in input document and then compresses this document using a syllable-based method. The second method, XMillSyl, incorporates syllable-based compression into the ex- isting method for XML compression XMill. XMLSyl and XMillSyl are compared with a non-XML syllable-based method and with other exist- ing method for XML compression. 1 Introduction The Extensible Markup Language (XML) [5] is a simple text format for struc- tured text documents. XML provides flexibility in storing, processing and ex- changing data on the Web. However, due to their verbosity, XML documents are usually larger in size than other exchange formats containing the same data content. One solution to this problem consists of compressing XML documents. Because XML is a text format, it is possible to compress XML documents with existing text compression methods. These methods are more effective, when XML documents have simple structure and long character data content. There are different types of text compression: text compression by characters and text compression by words. There is also a novel method: text compression by symbols [12]. In our work an application of this method to XML documents was devel- oped. Since single text compression is not able to discover and utilize the redun- dancy in the structure of XML, we modify syllable-based compression method for XML. At the beginning we supposed that XML syllable-based compression will be suitable for middle-sized textual XML documents. There are many XML documents that meet these conditions, for example any documentation written in DocBook [16] format or news in RSS format [18]. Moreover we suppose that our compression would be more suitable for documents in languages with rich morphology (for example Czech or German [12]). V. Snášel, K. Richta, J. Pokorný (Eds.): Dateso 2006, pp. 21–31, ISBN 80-248-1025-5. 22 Katsiaryna Chernik, Jan Lánský, Leo Galamboš 2 Syllable-based compression method Syllable-based compression [12] is the method where compression is performed at the syllable level. There are two syllable-based compressors. The first one is syllable-based LZW, and the second one is syllable-based Huffman. Algorithm LZW [11] is a dictionary compression character-based method. The syllable-based version is called LZWL. In the initialization step, the syllable dictionary is filled with empty syllable and syllables from a database of frequent syllables. The following steps are similar with character-based version of LZW, but LZWL works over an alphabet of syllables. The second syllable-based compression method is called HuffSyllable. It is a statistical compression method based on the adaptive Huffman coding. For our purposes, we use only LZWL syllable-based compression method. Adaptation of HuffSyllable for XML compression gave worse results than LZWL. 3 XMLSyl Our goal was to modify the syllable-compression method to compress XML doc- uments efficiently. We attempted to modify existing syllable-based compressor so, that it treats XML tokens (element tags and attributes) as single syllables instead of decomposing them into many syllables. There were two possibilities to compel the syllable-based compressor to treat XML tokens as syllables: 1. Modify parser used in the syllable-based tool and combine it with an XML parser, so that it can recognize XML tokens and treat them as a single syllable. 2. Replace XML tokens with bytes in the input document and then compress such a document with an existing syllable-based tool. We decided to implement the second way because this implementation allows us to make some future improvements easily. For example, we may compel the syllable-based compressor to assign codes with minimal length to XML tokens by adding this single bytes to the syllable dictionary[12]. This improvement is impossible in the first variant. The encoding of XML tokens is inspired by existing XML compression methods like XMLPPM [3], XGrind [6], XPress [9], XMill [8]. 3.1 Architecture and principles of XMLSyl The architecture of XMLSyl is shown in Figure 1. It has four major modules: the SAX Parser, the Structure Encoder, the Containers and the Syllable Compressor. First, the XML document is sent to the SAX Parser. Next the parser decomposes document into SAX events (start-tags, end-tags, data items, comments and etc.) and forwards them to the Structure Encoder. The Structure Encoder encodes the SAX events and routes them to the different Containers. There are three containers in our implementation: Syllable-based Compression for XML Documents 23 XML Document SAX Parser Structure Encoder Element Container Attribute Container Data and Structure Container Syllable Compressor Syllable Compressor Syllable Compressor Compressed XML document Fig. 1. The Architecture of XMLSylCompressor 1. Element Container: The Element Container stores the names of all ele- ments that occur in an XML document. The Structure Encoder also uses the Element Container as the dictionary for encoding XML structure. 2. Attribute Container: The Attribute Container stores the names of all attributes which occur in an XML document. The Structure Encoder also uses the Attribute Container as the dictionary for encoding XML structure. 3. Structure and Data Container: The Structure and Data Container stores an XML document, in which all meta-data are replaced with special codes. The encoding process is presented in section 3.2. When a document is parsed and separated into the containers completely, the contents of the containers are sent to the Syllable Compressor. It compresses the content of each container separately using syllable-based compression and sends the result to the output. We have not written the SAX parser by ourselves, rather we have used the Expat parser[10] which is an open-source SAX parser written in C. 3.2 Encoding the structure of XML document The structure of XML document is encoded in XMLSyl as follows. Whenever a new element or attribute is encountered, its name is sent to the dictionary and the index of the element is sent to the Data and Structure Container. Two different dictionaries are used for attributes and elements: the Element Dictionary and the Attribute Dictionary. The Attribute Container operates as the Attribute Dictionary and the Element Container as the Element Dictionary. Whenever an end tag is encountered a token END_TAG is sent to the Data and Structure container. Whenever a character sequence is encountered, it is sent to the Data and Structure Container without changes. Start and end of character sequences are indicated by special tokens. We distinguish four different character sequences: 24 Katsiaryna Chernik, Jan Lánský, Leo Galamboš value of attribute, value of element, comment, and white spaces between tags, if white spaces are preserved. To illustrate the encoding process, consider the encoding of the following small XML document: XML Brown Smith 49 First, the XML document is converted into a corresponding stream of SAX events: startElement("book") startElement("title",("lang","en")) characters("XML") endElement("title") startElement("author") characters("Smith") endElement("author") startElement("author") characters("Brown") endElement("author") startElement("price","currency","EURO") characters("49") endElement("price") endElement("book") comment("Comment") The tokens in the SAX event stream are sent to the Structure Encoder. It encodes them and sends them to their corresponding containers. When the book start element token is encountered, the string book is sent to the Element Container since this element name was not encountered before. An index E0 is assigned to this entry. This index is sent to the Data and Structure Container. The same operation is executed for title start element. String title is sent to The Element Container and an index E1 is assigned to it. The index E1 is sent to the Data and Structure Container. The element title has the attribute lang. The attribute name is sent to the Attribute Container and the index A0 is assigned to it. The index A0 is sent to the Data and Structure Container. Then attribute value ”en” is sent without modification to the Data and Structure Container. The ”en” attribute is followed by the token END_ATT, that signals the end of the attribute value. When an element value such as ”XML” is encountered, the token CHAR, signaling the beginning of character sequence, the data value and then the token END_CHAR are all sent to the Data and Structure Container. Finally, all Syllable-based Compression for XML Documents 25 the end tags are replaced by the token END_TAG. When a comment event is encountered, the code CMNT is put into the Data and Structure Container. The comment is also sent to the container and is enclosed by END_CMNT code. The final state of all containers is shown in Figure 2. Element Container element index Attribute Container book E0 attribute index title E1 lang A0 author E2 currency A1 price E3 Data and Structure Container XML E0 E1 A0 en END_ATT CHAR XML END_CHAR END_TAG E2 Brown Smith 49 A1 Euro END_ATT CHAR 49 END_CHAR END_TAG END_TAG CMNT Comment END_CMNT Fig. 2. Content of containers In this example we have ignored white spaces between tags, e.g. and , so the decompressor then produces a standard indentation. Optionally, XMLSyl can preserve the white spaces. In that case, it stores the white spaces as the sequence of characters in the Data and Structure Container between tokens WS and END_WS. 3.3 Containers The containers are the basic units for grouping XML data. The Attribute Con- tainer holds attribute names and the Element Container holds element names. As long as the number of all element and attribute names in any XML docu- ment is not high, this two containers are kept in main memory. During parsing, the containers size increases as the container is filled with entries. Each entry in the Element container is assigned a byte in the range 00-A9. These bytes are used for encoding the element names. Each entry in the Attribute container is assigned a byte in the range AA-F9. These bytes are used for encoding the attribute names. The residual 6 bytes are reserved for special codes like CHAR, END_TAG etc. In most cases, 170 (or 80) bytes are enough to encode element (or attribute) names. If the number of elements (or attributes) are greater than 170 (or 80), entries are encoded with two bytes, then tree and so on. There is another situation with The Data and Structure Container. We do not know the size of the input XML document. The size of XML document can be so big, that document will not fit into memory, and it is not possible to increase 26 Katsiaryna Chernik, Jan Lánský, Leo Galamboš the size of container endlessly. Therefore, the container consists of two memory block of constant size. The content of the first memory block is compressed, as soon as the container is filled. We don’t compress two blocks at once, because the context of the second memory block is used for compression of the first one. After the compression, the compressed content of the first block is sent to the output and the first block swaps its purpose with the second one. Now the first block is filled with data. When it is full, the second block is compressed, and so on. 3.4 The Syllable Compressor The Syllable Compressor compresses the Structure and Data Container first and sends the output to the output file. Then the Attribute Containers are compressed and sent to the output file and finally the same happens with the Element Container. LZWL is used for the compression of data. HuffSyll could be also chosen, but the performance is worse, so we decided to use only LZWL. 4 XMillSyl This chapter introduces our second syllable-based XML compressor, XMillSyl. This second method incorporates syllable-based compression with the existing method for XML compression of XMill [8]. XMill has two main principles in order to optimize XML compression: – separating structure from data content, and – grouping Data values with related semantics in the same ”container”. Each data container is then compressed individually with gzip [21]. In XMillSyl, containers are compressed with LZWL. We do not suppose that XMillSyl method gives better results than XMill because gzip compression performs better than LZWL. We have implemented XMillSyl in order to compare the power of XMLSyl with the power of two main principles of XMill. 4.1 Implementation We did not write XMill compressor. We decided to use existing sources of XMill. XMill operates as follows: a SAX parser parses the XML file and the SAX events are sent to the core module of the XMill called the path processor. It determines how to map tokens to containers: element tag names and attribute names are encoded and sent to the structure container, while the data values are sent to various data containers, according to their semantic. Finally, the containers are gzipped independently and stored on disk. We have modified compression and decompression functions (operating on containers) in the way they compress and decompress the data containers with Syllable-based Compression for XML Documents 27 Input XML file SAX Parser Path Processor Structure Container Small Data Container Large Data Container 1 … Data Container Large k GZip GZip LZWL LZWL Compressed XML file Fig. 3. Architecture of XMillSyl the syllable-based method (see Figure 3). Moreover we have modified the syllable- based method so that it can work with the containers of XMill implementation instead of a file stream. XMillSyl discerns the difference between small and large containers. Since LZWL is not suitable for extremely small data, the small containers are com- pressed with gzip. The structure container is also gzipped in XMillSyl. The large containers are compressed with LZWL. Table 1. The first data set. Size Lang Description elts 103919 English Periodic table of the elements in XML pcc 2600257 English Formal proofs transformed to XML stats 869059 English One year statistics if baseball players tal 1364576 English Safe-annotated assembly language converted to XML tpc 313193 English The XML representation of the TPC_D benchmark database. Size Lan Description errors 153530 English "The Comedy of Errors" marked up as XML hamlet 314677 English "The Tragedy of Hamlet, Prince of Denmark" marked up as XML antony 289865 English "The Tragedy of Antony and Cleopatra" marked up as XML much_ado 220495 English "Much Ado about Nothing" marked up as XML 5 Comparison ch00 Experiments 13916 English "DocBook: The Definitive Guide" in DocBook format (1) ch01 55015 English "DocBook: The Definitive Guide" in DocBook format (2) ch02 160728 English "DocBook: The Definitive Guide" in DocBook format (3) To show ch03 the effectiveness 27799 English of XMLSyll "DocBook: and Guide" The Definitive XMillSyl, weformat in DocBook compared (4) the perfor- mancech04 137440compressors of this two English "DocBook:with The Definitive Guide" in DocBook format one representative of (6) XML compressors XMillch05 67142 English "DocBook: The Definitive Guide" in DocBook format (7) and the syllable-based compressor LZWL [12]. glossary 24701 English "DocBook: The Definitive Guide" in DocBook format (8) howto 42853 English "DocBook V5.0, Transition Guide" in DocBook format. hledani 16429 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (1) komunikace 50881 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (2) navihace 18495 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (3) robot 25405 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (4) xml 28467 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (5) rur1 59609 Czech "R.U.R" marked up as XML. V set2 Murkup menshe chem 50% I harakter dannych tekstovyj=> pokazyvajet horoshije rezultaty. 28 Katsiaryna Chernik, Jan Lánský, Leo Galamboš 5.1 XML data sources XMLSyl and XMillSyl were tested on two data sets that cover a wide range of XML data formats and structures. The first data set is shown in Table 1. It contains English XML documents with different inner structure. It includes regular data that has regular markup and short character data content (elts, stats, weblog, tpc). It also includes irregular data, that has irregular markup (pcc, tall). The second data set is shown in Table 2. It contains textual XML documents of simple structure with long character data content. It contains five stage plays marked up as XML, four in English and one in Czech. It also contains data in DocBook format in Czech and in English. Some dataSize was distributed Lang with the XMLPPM [3] and the Exalt [4] com- Description pressors elts while others were found 103919 English on of Periodic table Internet [15], the elements [16]. All Czech documents use in XML Windows-1250 pcc encoding. 2600257 English Formal proofs transformed to XML stats 869059 English One year statistics if baseball players tal 1364576 English Safe-annotated assembly language converted to XML tpc 313193 EnglishTable The XML2. The second representation of thedata set. TPC_D benchmark database. Size Lan Description errors 153530 English "The Comedy of Errors" marked up as XML hamlet 314677 English "The Tragedy of Hamlet, Prince of Denmark" marked up as XML antony 289865 English "The Tragedy of Antony and Cleopatra" marked up as XML much_ado 220495 English "Much Ado about Nothing" marked up as XML ch00 13916 English "DocBook: The Definitive Guide" in DocBook format (1) ch01 55015 English "DocBook: The Definitive Guide" in DocBook format (2) ch02 160728 English "DocBook: The Definitive Guide" in DocBook format (3) ch03 27799 English "DocBook: The Definitive Guide" in DocBook format (4) ch04 137440 English "DocBook: The Definitive Guide" in DocBook format (6) ch05 67142 English "DocBook: The Definitive Guide" in DocBook format (7) glossary 24701 English "DocBook: The Definitive Guide" in DocBook format (8) howto 42853 English "DocBook V5.0, Transition Guide" in DocBook format. hledani 16429 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (1) komunikace 50881 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (2) navihace 18495 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (3) robot 25405 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (4) xml 28467 Czech "Inteligentní podpora navigace na WWW s využitím XML" in DocBook (5) rur1 59609 Czech "R.U.R" marked up as XML. V set2 Murkup menshe chem 50% I harakter dannych tekstovyj=> pokazyvajet horoshije rezultaty. 5.2 Compression Performance Metrics The compression ratio is defined as follows: sizeof (compressed f ile) × 8 CR = bits/byte. sizeof (original f ile) We compare XMillSyl and XMLSyl compression ratios with those of XMill. The compression ratio factor shows normalization of the compression ratio of Syllable-based Compression for XML Documents 29 Table 3. The first data set. CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl 1 1 elts 1,04 0,47 0,54 1,15 0,72 1,53 2 pcc 0,22 0,02 0,03 1,50 0,04 2,00 3 stats 0,67 0,33 0,40 1,21 0,39 1,18 4 tal 0,36 0,09 0,12 1,33 0,15 1,67 5 tpc 1,82 1,05 1,54 1,47 1,60 1,52 Average 0,82 0,39 0,53 1,33 0,58 1,58 CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl 1 errors 1,98 1,83 2,00 1,09 1,83 1,00 2 XMillSyll or XMLSyl 1,96 hamlet with respect 1,91 to 2,00 XMill. The1,05compression 1,85 ratio 0,97factor is 3 defined antony as follows: 1,84 1,79 1,88 1,05 1,69 0,94 4 much_ado 1,88 1,80 CRXSyl1,05 1,89 1,77 0,98 CRF = . 5 ch00 3,28 2,69 XSyl 3,00CRXM ill1,12 2,88 1,07 6 ch01 2,69 2,20 2,43 1,10 2,46 1,12 7 ch02 1,76 1,43 1,70 1,19 1,57 1,10 8 ch03 2,90 1,87 2,70 1,44 2,08 1,11 9 ch04 2,09 1,66 1,78 1,07 1,83 1,10 10 ch05 5.3 Experimental 2,28 Results 1,81 2,03 1,12 2,04 1,13 11 glossary 2,07 1,64 1,84 1,12 1,89 1,15 12 The compression howto ratio6,69 statistics 2,30 of two2,50 sets of XML 1,09 documents 2,59 are 1,13shown in 13 and Table 4. 3,79 Table 3 hledani 3,13 3,62 1,16 3,40 1,09 14 The komunikace syllable-based 3,25 method2,65performed2,93 worse on1,11 3,01from the documents 1,14first data 15 navihace 3,79 3,14 3,68 1,17 3,44 set. On the other hand, both XMLSyl and XMillSyl shows great improvement 1,10 16 robot 3,43 2,86 3,22 1,13 3,04 1,06 17 comparing xml to LZWL. 3,74They compressed 3,23 the input1,14 3,69 to 50-60% 3,30 of the1,02 size of the 18 compressed rur1 file with LZWL. 2,33 2,07 2,37 1,14 2,15 1,04 On XML Average documents 2,88of the 2,22second 2,51 data set, 1,13 LZWL provides 2,38 a1,07 reasonably good compression ratio - on the average, about two-thirds that of XMill. This confirmschour prediction, 1,84that 1,61 1,78 compression syllable-based 1,11 is1,70 effective1,06 for textual1,11 books 1,71 1,79 1,75 0,98 1,66 0,93 XML documents. Moreover our compression methods show even greater im- ch+books 1,80 1,74 1,76 1,01 1,72 0,99 provement. 3,13 2,63 2,81 1,07 2,93 1,11 0,935943 On the document of the second 2,83 2,32 data set, XMillSyl 2,51 1,08 achieves 2,60 about 1,1215% and 0,924303 XMLSyl is about 20% better 2,78 compression 2,28 ratio than 2,47 LZWL.2,57 1,08 Compared 1,13to 0,923077 XMill, both methods perform 2,58 slightly2,14 2,30 worse. XMillSyl 1,07 compresses 2,40 13% about 1,12 and0,930435 XML- 2,49 Syl about 7% worse than XMill. 2,15 2,32 1,08 2,34 1,09 0,926724 2,40 2,07 2,22 1,07 2,25 1,09 0,932432 Figure 4 shows the2,30 variation1,97 of the 2,17 compression 1,10 ratio as a function 2,15 of XML 1,09 0,907834 data size for ”DocBook:2,21 The1,90Definitive2,08Guide”.The 1,09 compression 2,06 was run 1,08 on 0,913462 several subsets. On small 2,17 files1,89 XMillSyl2,10 performs 1,11 better than 2,03XMLSyl.1,07 The ex- 0,9 planation is, that the 2,07 data are 1,80split into 2,01 many 1,12 1,93 small containers 1,07 in 0,895522 XMillSyl, which are compressed 1,98 with gzip1,73 (gzip 1,93 outperforms 1,12LZWL, 1,84 1,06on0,896373 especially small 1,92 1,68 1,88 1,12 1,79 1,07 0,893617 data). On middle-sized1,89and large 1,65 files XMLSyl 1,85 outperforms 1,12 1,76 XMillSyl. We can 1,07 0,891892 observe that the bigger1,88 size also 1,64implies1,83 a better compression. 1,12 1,74 1,06 0,896175 6 Conclusion In this work we introduced syllable-based compression tools for XML documents called XMLSyl and XMillSyl. We presented the architecture and implementation 30 CRLZWL CR Katsiaryna Chernik, Jan CRXMillSyl Lánský, Xmill CRFXMillSyll CRXMLSyl CRFXMLSyl Leo Galamboš 1 1 elts 1,04 0,47 0,54 1,15 0,72 1,53 2 pcc 0,22 0,02 0,03 1,50 0,04 2,00 3 stats 0,67 0,33 0,40 1,21 0,39 1,18 4 tal 0,36 0,09 0,12 1,33 0,15 1,67 5 tpc 1,82 1,05 1,54 1,47 1,60 1,52 Table 4. The first data set. Average 0,82 0,39 0,53 1,33 0,58 1,58 CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl 1 errors 1,98 1,83 2,00 1,09 1,83 1,00 2 hamlet 1,96 1,91 2,00 1,05 1,85 0,97 3 antony 1,84 1,79 1,88 1,05 1,69 0,94 4 much_ado 1,88 1,80 1,89 1,05 1,77 0,98 5 ch00 3,28 2,69 3,00 1,12 2,88 1,07 6 ch01 2,69 2,20 2,43 1,10 2,46 1,12 7 ch02 1,76 1,43 1,70 1,19 1,57 1,10 8 ch03 2,90 1,87 2,70 1,44 2,08 1,11 9 ch04 2,09 1,66 1,78 1,07 1,83 1,10 10 ch05 2,28 1,81 2,03 1,12 2,04 1,13 11 glossary 2,07 1,64 1,84 1,12 1,89 1,15 12 howto 6,69 2,30 2,50 1,09 2,59 1,13 13 hledani 3,79 3,13 3,62 1,16 3,40 1,09 14 komunikace 3,25 2,65 2,93 1,11 3,01 1,14 15 navihace 3,79 3,14 3,68 1,17 3,44 1,10 16 robot 3,43 2,86 3,22 1,13 3,04 1,06 17 xml 3,74 3,23 3,69 1,14 3,30 1,02 18 rur1 2,33 2,07 2,37 1,14 2,15 1,04 Average 2,88 2,22 2,51 1,13 2,38 1,07 ch 1,84 1,61 1,78 1,11 1,70 1,06 1,11 books 1,71 1,79 1,75 0,98 1,66 0,93 ch+books 1,80 1,74 1,76 1,01 1,72 0,99 3,13 2,63 2,81 1,07 2,93 1,11 0,935943 2,83 2,32 2,51 1,08 2,60 1,12 0,924303 3,25 3,20 2,78 2,28 2,47 1,08 2,57 1,13 0,923077 3,15 3,10 3,05 2,58 2,14 2,30 1,07 2,40 1,12 0,930435 2,49 2,15 2,32 1,08 2,34 1,09 0,926724 Compression Ratio (Bit/Byte) 3,00 2,95 2,90 2,85 2,40 2,07 2,22 1,07 2,25 1,09 0,932432 2,80 2,75 2,70 2,30 1,97 2,17 1,10 2,15 1,09 0,907834 2,65 2,60 2,21 1,90 2,08 1,09 2,06 1,08 0,913462 2,55 2,50 2,45 2,17 1,89 2,10 1,11 2,03 1,07 LZWL 0,9 2,40 2,35 2,07 1,80 2,01 1,12 1,93 1,07 XMillSyl 0,895522 2,30 2,25 1,98 1,73 1,93 1,12 1,84 1,06 XMLSyl 0,896373 2,20 2,15 2,10 1,92 1,68 1,88 1,12 1,79 1,07 XMill 0,893617 2,05 2,00 1,89 1,65 1,85 1,12 1,76 1,07 0,891892 1,95 1,90 1,85 1,88 1,64 1,83 1,12 1,74 1,06 0,896175 1,80 1,75 1,70 1,65 1,60 1,55 1,50 16149 27481 36483 58149 86991 112592 145116 168453 201423 217322 232569 254382 281915 303018 328481 355426 385967 399688 424093 446610 Size (byte) Fig. 4. Compression ratio under different sizes. Syllable-based Compression for XML Documents 31 of our tools and tested their performance on a variety of XML documents. In our experiments, XMLSyl and XMillSyl were compared with LZWL and XMill. Both methods are more suitable for textual XML documents. XMill outperformed our methods only marginally. XMLSyl performs better than XMillSyl. It implies that in our case encoding of XML structure is more efficient than separating a structure from data and grouping data values with related meaning. XMillSyl and XMLSyl show better results for Czech language. In the future, we want implement some modifications to enhance the com- pression ratio. For example, the information in the DTD section can be extracted and utilized to create a special syllable dictionary for elements and attributes. References 1. Wilfred Ng, Lam Wai, Yeung James Cheng. Comparative Analysis of XML Com- pression Technologies. World Wide Web Journal, 2005 2. Smitha S. Nair. XML Compression Techniques: A Survey. www.cs.uiowa.edu/~rlawrenc/research/Students/SN_04_XMLCompress.pdf 3. J. Cheney. Compressing XML with Multiplexed Hierarchical PPM Models In Proc. Data Compression Conference, 2001. 4. V. Toman. Compression of XML Data. MFF UK, 2003 5. World Wide Web Consorcium. Extensive Markup Language (XML) 1.0. http://www.w3.org/XML/ 6. P. Tolani, J. R. Haritsa. XGrind: A Query-friendly XML Compressor. In Proc. IEEE International Conference on Data Engineering, 2002. 7. SAX: A Simple API for XML. http://www.saxproject.org 8. H. Liefke, D. Suciu. XMill: an Efficient Compressor for XML Data. In Proc. ACM SIGMOD Conference, 2000. 9. Jun-Ki Min, Myung-Jae Park, Chin-Wan Chung, XPRESS: A Queriable Com- pression for XML Data SIGMOD 2003, June 912, 2003, San Diego, CA, 2000. 10. Expat XML Parser. http://expat.sourceforge.net 11. T. A. Welch. A technique for high performance data compression. IEEE Computer, 1984. 12. J. Lansky, M. Zemlicka. Text Compression: Syllables. DATESO, 2005 13. J. Lansky, Slabiková komprese. MFF UK, 2005 14. V. Toman. Komprese XML dat. http://kocour.ms.mff.cuni.cz/~mlynkova/prg036/ 15. J. Kosek. Inteligentnı́ podpora navigace na WWW s využitı́m XML. http://www.kosek.cz/diplomka/, 2002 16. DocBook http://www.docbook.org/ 17. A Quick Introduction to XML. http://www.cellml.org/tutorial/xml_guide 18. M. Pilgrim. What Is RSS. http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html 19. XML Processing. http://diveintopython.org/xml_processing/ 20. SAX And DOM Overview. http://www.jezuk.co.uk/cgi-bin/view/arabica/SAXandDOMIntro 21. The gzip home page. http://www.gzip.org/