Introduction

SSyyllllaabbllee--bbaasseedd Ccoommpprreessssiioonn ffoorr XXMMLL dDooccuummeennttss

ttssii

rryynn

rrnniikk

nn LL´

nnsskky´y´

oo GG

ooˇˇss

0 0 Charles University, Faculty of Mathematics and Physics Malostransk ́e n ́am.

25, 118 00 Praha 1, Czech

21 31

Syllable-based compression achieves sufficiently good results on text documents of a medium size. Since the majority of XML documents are of that size, we suppose that the syllable-based method can give good results on XML documents, especially on documents that have a simple structure (small amount of elements and attributes) and relatively long character data content. In this paper we propose two syllable-based compression methods for XML documents. The first method, XMLSyl, replaces XML tokens (element tags and attributes) by special codes in input document and then compresses this document using a syllable-based method. The second method, XMillSyl, incorporates syllable-based compression into the existing method for XML compression XMill. XMLSyl and XMillSyl are compared with a non-XML syllable-based method and with other existing method for XML compression.

Introduction Syllable-based compression method

Syllable-based compression [ 12 ] is the method where compression is performed at the syllable level. There are two syllable-based compressors. The first one is syllable-based LZW, and the second one is syllable-based Huffman.

Algorithm LZW [ 11 ] is a dictionary compression character-based method. The syllable-based version is called LZWL. In the initialization step, the syllable dictionary is filled with empty syllable and syllables from a database of frequent syllables. The following steps are similar with character-based version of LZW, but LZWL works over an alphabet of syllables.

The second syllable-based compression method is called HuffSyllable. It is a statistical compression method based on the adaptive Huffman coding. For our purposes, we use only LZWL syllable-based compression method. Adaptation of HuffSyllable for XML compression gave worse results than LZWL. 3

XMLSyl

Our goal was to modify the syllable-compression method to compress XML documents efficiently. We attempted to modify existing syllable-based compressor so, that it treats XML tokens (element tags and attributes) as single syllables instead of decomposing them into many syllables. There were two possibilities to compel the syllable-based compressor to treat XML tokens as syllables: 1. Modify parser used in the syllable-based tool and combine it with an XML parser, so that it can recognize XML tokens and treat them as a single syllable. 2. Replace XML tokens with bytes in the input document and then compress such a document with an existing syllable-based tool.

We decided to implement the second way because this implementation allows us to make some future improvements easily. For example, we may compel the syllable-based compressor to assign codes with minimal length to XML tokens by adding this single bytes to the syllable dictionary [ 12 ]. This improvement is impossible in the first variant. The encoding of XML tokens is inspired by existing XML compression methods like XMLPPM [ 3 ], XGrind [ 6 ], XPress [ 9 ], XMill [ 8 ]. 3.1

Architecture and principles of XMLSyl The architecture of XMLSyl is shown in Figure 1. It has four major modules: the SAX Parser, the Structure Encoder, the Containers and the Syllable Compressor. First, the XML document is sent to the SAX Parser. Next the parser decomposes document into SAX events (start-tags, end-tags, data items, comments and etc.) and forwards them to the Structure Encoder.

The Structure Encoder encodes the SAX events and routes them to the different Containers. There are three containers in our implementation: XML Document

SAX Parser Structure Encoder

Element Container

Attribute Container Data and Structure Container Syllable Compressor

Syllable Compressor

Compressed XML document 1. Element Container: The Element Container stores the names of all elements that occur in an XML document. The Structure Encoder also uses the Element Container as the dictionary for encoding XML structure. 2. Attribute Container: The Attribute Container stores the names of all attributes which occur in an XML document. The Structure Encoder also uses the Attribute Container as the dictionary for encoding XML structure. 3. Structure and Data Container: The Structure and Data Container stores an XML document, in which all meta-data are replaced with special codes. The encoding process is presented in section 3.2.

When a document is parsed and separated into the containers completely, the contents of the containers are sent to the Syllable Compressor. It compresses the content of each container separately using syllable-based compression and sends the result to the output.

We have not written the SAX parser by ourselves, rather we have used the Expat parser[ 10 ] which is an open-source SAX parser written in C. 3.2

Encoding the structure of XML document The structure of XML document is encoded in XMLSyl as follows. Whenever a new element or attribute is encountered, its name is sent to the dictionary and the index of the element is sent to the Data and Structure Container. Two different dictionaries are used for attributes and elements: the Element Dictionary and the Attribute Dictionary. The Attribute Container operates as the Attribute Dictionary and the Element Container as the Element Dictionary. Whenever an end tag is encountered a token END_TAG is sent to the Data and Structure container. Whenever a character sequence is encountered, it is sent to the Data and Structure Container without changes. Start and end of character sequences are indicated by special tokens. We distinguish four different character sequences: value of attribute, value of element, comment, and white spaces between tags, if white spaces are preserved.

To illustrate the encoding process, consider the encoding of the following small XML document: <book> <title lang="en">XML</title> <author>Brown</author> <author>Smith</author> <price currency="EURO">49</price> </book>  First, the XML document is converted into a corresponding stream of SAX events: startElement("book") startElement("title",("lang","en")) characters("XML") endElement("title") startElement("author") characters("Smith") endElement("author") startElement("author") characters("Brown") endElement("author") startElement("price","currency","EURO") characters("49") endElement("price") endElement("book") comment("Comment")

The tokens in the SAX event stream are sent to the Structure Encoder. It encodes them and sends them to their corresponding containers. When the book start element token is encountered, the string book is sent to the Element Container since this element name was not encountered before. An index E0 is assigned to this entry. This index is sent to the Data and Structure Container. The same operation is executed for title start element. String title is sent to The Element Container and an index E1 is assigned to it. The index E1 is sent to the Data and Structure Container. The element title has the attribute lang. The attribute name is sent to the Attribute Container and the index A0 is assigned to it. The index A0 is sent to the Data and Structure Container. Then attribute value ”en” is sent without modification to the Data and Structure Container. The ”en” attribute is followed by the token END_ATT, that signals the end of the attribute value. When an element value such as ”XML” is encountered, the token CHAR, signaling the beginning of character sequence, the data value and then the token END_CHAR are all sent to the Data and Structure Container. Finally, all the end tags are replaced by the token END_TAG. When a comment event is encountered, the code CMNT is put into the Data and Structure Container. The comment is also sent to the container and is enclosed by END_CMNT code. The final state of all containers is shown in Figure 2.

Element Container

element index book E0 title E1 author E2 price E3

Attribute Container attribute index lang A0 currency A1 Data and Structure Container

<book> <title lang="en"> XML </title> <author> E0 E1 A0 en END_ATT CHAR XML END_CHAR END_TAG E2 Brown </author> <author> Smith </author> <price CHAR Brown END_CHAR END_TAG E2 CHAR Smith END_CHAR END_TAG E3 currency="EURO"> 49 </price> </book>  A1 Euro END_ATT CHAR 49 END_CHAR END_TAG END_TAG CMNT Comment END_CMNT

In this example we have ignored white spaces between tags, e.g. <book> and <title>, so the decompressor then produces a standard indentation. Optionally, XMLSyl can preserve the white spaces. In that case, it stores the white spaces as the sequence of characters in the Data and Structure Container between tokens WS and END_WS. 3.3

Containers The containers are the basic units for grouping XML data. The Attribute Container holds attribute names and the Element Container holds element names. As long as the number of all element and attribute names in any XML document is not high, this two containers are kept in main memory. During parsing, the containers size increases as the container is filled with entries. Each entry in the Element container is assigned a byte in the range 00-A9. These bytes are used for encoding the element names. Each entry in the Attribute container is assigned a byte in the range AA-F9. These bytes are used for encoding the attribute names. The residual 6 bytes are reserved for special codes like CHAR, END_TAG etc. In most cases, 170 (or 80) bytes are enough to encode element (or attribute) names. If the number of elements (or attributes) are greater than 170 (or 80), entries are encoded with two bytes, then tree and so on.

There is another situation with The Data and Structure Container. We do not know the size of the input XML document. The size of XML document can be so big, that document will not fit into memory, and it is not possible to increase the size of container endlessly. Therefore, the container consists of two memory block of constant size. The content of the first memory block is compressed, as soon as the container is filled. We don’t compress two blocks at once, because the context of the second memory block is used for compression of the first one. After the compression, the compressed content of the first block is sent to the output and the first block swaps its purpose with the second one. Now the first block is filled with data. When it is full, the second block is compressed, and so on. 3.4

The Syllable Compressor The Syllable Compressor compresses the Structure and Data Container first and sends the output to the output file. Then the Attribute Containers are compressed and sent to the output file and finally the same happens with the Element Container. LZWL is used for the compression of data. HuffSyll could be also chosen, but the performance is worse, so we decided to use only LZWL. 4

XMillSyl

This chapter introduces our second syllable-based XML compressor, XMillSyl. This second method incorporates syllable-based compression with the existing method for XML compression of XMill [ 8 ]. XMill has two main principles in order to optimize XML compression: – separating structure from data content, and – grouping Data values with related semantics in the same ”container”. Each data container is then compressed individually with gzip [ 21 ]. In XMillSyl, containers are compressed with LZWL.

We do not suppose that XMillSyl method gives better results than XMill because gzip compression performs better than LZWL. We have implemented XMillSyl in order to compare the power of XMLSyl with the power of two main principles of XMill. 4.1

Implementation We did not write XMill compressor. We decided to use existing sources of XMill.

XMill operates as follows: a SAX parser parses the XML file and the SAX events are sent to the core module of the XMill called the path processor. It determines how to map tokens to containers: element tag names and attribute names are encoded and sent to the structure container, while the data values are sent to various data containers, according to their semantic. Finally, the containers are gzipped independently and stored on disk.

We have modified compression and decompression functions (operating on containers) in the way they compress and decompress the data containers with Input XML file

SAX Parser Path Processor

Large Data Container k

GZip GZip LZWL LZWL

Compressed XML file the syllable-based method (see Figure 3). Moreover we have modified the syllablebased method so that it can work with the containers of XMill implementation instead of a file stream.

XMillSyl discerns the difference between small and large containers. Since LZWL is not suitable for extremely small data, the small containers are compressed with gzip. The structure container is also gzipped in XMillSyl. The large containers are compressed with LZWL.

elts pcc stats tal tpc

V set2 Murkup menshe chem 50% I harakter dannych tekstovyj=> pokazyvajet horoshije rezultaty. 5.1

XML data sources XMLSyl and XMillSyl were tested on two data sets that cover a wide range of XML data formats and structures. The first data set is shown in Table 1. It contains English XML documents with different inner structure. It includes regular data that has regular markup and short character data content (elts, stats, weblog, tpc). It also includes irregular data, that has irregular markup (pcc, tall).

The second data set is shown in Table 2. It contains textual XML documents of simple structure with long character data content. It contains five stage plays marked up as XML, four in English and one in Czech. It also contains data in DocBook format in Czech and in English.

Some dataSiwzeas dLiasntgribuDteesdcripwtiointh the XMLPPM [ 3 ] and the Exalt [ 4 ] compressoerltss while o1t0h3e91r9sEwngelisrhe fPoeurionddic taobnle oIfnthteerenlemeten[ts1i5n]X,M[L16]. All Czech documents use Windopcwc s-12502e60n0c2o57dEinngglis.h Formal proofs transformed to XML stats 869059 English One year statistics if baseball players tal 1364576 English Safe-annotated assembly language converted to XML tpc 313193 EnglishTTahbe lXeML2r.epTrehseentsaetiocnoonfdthedTaPtCa_Dsebte.nchmark database.

CRLZWL CRXmill CRXMillSyl CRFXMillSyll CRXMLSyl CRFXMLSyl errors 1,98 1,83 2,00 1,09 1,83 1,00 XMillSyhlalmolert XMLSyl 1w,9i6th r1e,s9p1ect to 2X,0M0ill. Th1e,05compr1e,s8s5ion rat0i,o97factor is defined aanstofnoyllows: 1,05

1,84 1,79 1,88 1,69 0,94 ch00 3,28 C2R,6F9XSyl =13,,80CC90RRXXMSyilll11.,,0152 2,88 1,07 much_ado 1,88 1,80 1,77 0,98 ch01 2,69 2,20 2,43 1,10 2,46 1,12 ch02 1,76 1,43 1,70 1,19 1,57 1,10 ch03 2,90 1,87 2,70 1,44 2,08 1,11 ch04 2,09 1,66 1,78 1,07 1,83 1,10 5.3 Ecxhp05erimental 2R,2e8sult1s,81 2,03 1,12 2,04 1,13 glossary 2,07 1,64 1,84 1,12 1,89 1,15 The comhopwrteossion ratio6,s6t9atist2i,c3s0 of two2,5s0ets of X1M,09L docu2,m59ents ar1e,13shown in Table 3halenddanTiable 4. 3,79 3,13 3,62 1,16 3,40 1,09

The ksoymlluanbiklea-cbeased 3m,2e5thod2,p6e5rforme2d,9w3orse on1d,1o1cumen3t,0s1from t1h,e14first data set. On ntahveihaocteher hand3,,79both3X,14MLSyl 3a,6n8d XMill1S,1y7l show3s,4g4reat im1,p10rovement comparirnobgotto LZWL. 3T,4h3ey c2o,m86pressed3,2th2e input1,t1o3 50-603,%04 of the1,s0i6ze of the xml 3,74 3,23 3,69 1,14 3,30 1,02 compresrsuer1d file with L2Z,3W3 L. 2,07 2,37 1,14 2,15 1,04

On XAvMerLagdeocumen2t,s88of th2,e22second2d,5a1ta set, 1L,1Z3WL p2r,o3v8ides a1,r0e7asonably good compression ratio - on the average, about two-thirds that of XMill. This confirmschour prediction1,,84that 1s,y61llable-b1a,s7e8d comp1r,e1s1sion is1,7e0ffective1,f0o6r textu1a,1l1 XML dboocoukmsents. Mor1e,7o1ver o1,u7r9 compr1e,7s5sion me0t,h9o8ds sh1o,w66 even 0g,9re3ater imch+books 1,80 1,74 1,76 1,01 1,72 0,99 provement.

3,13 2,63 2,81 1,07 2,93 1,11 0,935943

On the document o2f,8t3he s2e,c3o2nd dat2a,5s1et, XMi1l,l0S8yl ach2ie,6v0es abo1u,t12150%,92a4n3d03 XMLSyl is about 20% b2,e7t8ter c2o,2m8pressio2n,4r7atio tha1n,0L8ZWL2.,C57ompar1e,d13to 0X,9M23i0ll7,7 both methods perform2s,l5i8ghtly2,w14orse. X M2,3il0lSyl com1,p0r7esses a2b,4o0ut 13%1,1a2nd0X,9M30L43-5 Syl about 7% worse th2a,4n9XM2i,l1l.5 2,32 1,08 2,34 1,09 0,926724 Figure 4 shows the2v,4a0riati2o,n07of the 2c,o2m2pressio1n,07ratio a2s,2a5 funct1io,0n9 of0,X93M24L32 2,30 1,97 2,17 1,10 2,15 1,09 0,907834 data size for ”DocBoo2k,2:1 The1,9D0efinitiv2e,08Guide”.1T,0h9e com2p,0re6ssion 1w,0a8s r0u,9n13o4n62 several subsets. On sm2a,1ll7files1,X89MillSyl2,p1e0rforms 1b,1e1tter th2a,0n3XMLS1y,0l7. The ex0-,9 planation is, that the2d,0a7ta a1r,e80split in2t,0o1 many 1sm,12all con1t,9a3iners i1n,0X7M0i,8ll9S5y5l2,2 which are compressed 1w,9i8th g1z,i7p3 (gzip 1o,u9t3perform1s,12LZWL1,,8e4special1l,y06on0,s8m96a3l7l3 data). On middle-sized1,9a2nd l1a,r6g8e files 1X,8M8 LSyl o1u,t1p2erform1,s79XMillS1y,0l7. W0,e89c3a6n17 1,89 1,65 1,85 1,12 1,76 1,07 0,891892 observe that the bigge1r,8s8ize a1ls,6o4implies1,a83better c1o,1m2 press1io,7n4. 1,06 0,896175 6

Conclusion

In this work we introduced syllable-based compression tools for XML documents called XMLSyl and XMillSyl. We presented the architecture and implementation

Katsiaryna CChReLZrWnLik,CJRaXnmiLll´aCnsRkXy´M,illSLyelo GCRalFaXmMibllSoyˇlsl CRXMLSyl elts 1,04 0,47 0,54 1,15 0,72 pcc 0,22 0,02 0,03 1,50 0,04 stats 0,67 0,33 0,40 1,21 0,39 tal 0,36 0,09 0,12 1,33 0,15 tpc 1,82 Ta1b,l0e5 4. The1,fi5r4st data s1e,4t7. 1,60 Average 0,82 0,39 0,53 1,33 0,58

CRLZWL 1,98 1,96 1,84 1,88 3,28 2,69 1,76 2,90 2,09 2,28 2,07 6,69 3,79 3,25 3,79 3,43 3,74 2,33 2,88 Fig. 4. Compression ratio under different sizes. of our tools and tested their performance on a variety of XML documents. In our experiments, XMLSyl and XMillSyl were compared with LZWL and XMill. Both methods are more suitable for textual XML documents. XMill outperformed our methods only marginally. XMLSyl performs better than XMillSyl. It implies that in our case encoding of XML structure is more efficient than separating a structure from data and grouping data values with related meaning. XMillSyl and XMLSyl show better results for Czech language.

In the future, we want implement some modifications to enhance the compression ratio. For example, the information in the DTD section can be extracted and utilized to create a special syllable dictionary for elements and attributes.

Wilfred

Ng , Lam Wai, Yeung James Cheng. Comparative Analysis of XML Compression Technologies . World Wide Web Journal , 2005

2. Smitha

Nair. XML Compression

Techniques: A Survey. www .cs.uiowa.edu/~rlawrenc/research/Students/SN_04_XMLCompress.pdf

Cheney . Compressing XML with Multiplexed Hierarchical PPM Models In Proc. Data Compression Conference , 2001 .

Toman . Compression of XML Data . MFF UK , 2003

5. World Wide Web Consorcium. Extensive Markup Language (XML) 1.0 . http://www.w3.org/XML/

Tolani ,

J. R.

Haritsa . XGrind: A Query-friendly XML Compressor . In Proc. IEEE International Conference on Data Engineering , 2002 .

7. SAX: A Simple API for XML . http://www.saxproject.org

Liefke ,

Suciu . XMill: an Efficient Compressor for XML Data . In Proc. ACM SIGMOD Conference , 2000 .

9. Jun-Ki

Min

, Myung-Jae

Park

, Chin-Wan

Chung

, XPRESS: A Queriable Compression for XML Data SIGMOD 2003 , June 912, 2003 , San Diego, CA, 2000 .

10. Expat

XML

Parser. http://expat.sourceforge.net

11.

T. A.

Welch . A technique for high performance data compression . IEEE Computer , 1984 .

12. J. Lansky , M. Zemlicka . Text Compression: Syllables. DATESO, 2005

13. J. Lansky , Slabikov´a komprese . MFF UK , 2005

14.

Toman . Komprese XML dat . http://kocour.ms.mff.cuni.cz/~mlynkova/prg036/

15.

Kosek . Inteligentn´ı podpora navigace na WWW s vyuˇzit´ım XML . http://www.kosek.cz/diplomka/, 2002

16. DocBook http://www.docbook.org/

17. A Quick Introduction to XML. http://www.cellml.org/tutorial/xml_guide

18.

Pilgrim . What Is RSS. http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

19.

XML

Processing . http://diveintopython.org/xml_processing/

20. SAX And DOM Overview . http://www.jezuk.co.uk/cgi-bin/view/arabica/SAXandDOMIntro

21. The gzip home page . http://www.gzip.org/