<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SSyyllllaabbllee--Bbaasseedd BBuurrrroowwss--WWhheeeelleerr TTrraannssffoorrmm??</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan L´ansky´</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katsiaryna Chernik</string-name>
          <email>e@cghmRaielp.ucbolimc</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zuzana Vlˇckov´a Jan La´nsky´</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katsiaryna Chernik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zuzana Vlˇckov´a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics CMhaalorlsetsraUnnskiv ́eerns ́aitmy</institution>
          ,
          <addr-line>. F25a,cu1l1t8y 0o0f MPraathhaem1,atCiczsecahndRePphuybsliiccs Mzailzoestlreavnaskk, ́e nk ́acmhe.r2n5i,k1,18zu0z0anPar.avhlac1k,ovCaz</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Burrows-Wheeler Transform (BWT) is a compression method which reorders an input string into the form, which is preferable to another compression. Usually Move-To-Front transform and then Huffman coding is used to the permutated string. The original method [3] from 1994 was designed for an alphabet compression. In 2001, versions working with word and n-grams alphabet were presented. The newest version copes with the syllable alphabet [7]. The goal of this article is to compare the BWT compression working with alphabet of letters, syllables, words, 3-grams and 5-grams.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The Burrows-Wheeler Transform is an algorithm that takes a block of text as
input and rearranges it using a sorting algorithm. The output can be compressed
with another algorithms such as bzip. To compare different ways of document
parsing and their influence on BWT, we decided to deal with natural units of the
text: letters, syllables and words; and compare these approaches with each other
and also with the methods that divide the text into unnatural units – N-grams.
In our measurements we found out that depending upon the language, words
have 5–6 letters and syllables 2–3 letters in average. Therefore 3-grams were
chosen to conform to the syllable length and 5-grams to correspond to average
words length.</p>
      <p>If we want to compare different BWT for various ways of text file parsing, it
is necessary to use an implementation modified only in document parsing; the
rest of compression method will stay unchanged.</p>
      <p>
        Very important BWT parameter is a block size the document is compressed
into; the larger block, the better results. For example, in bzip2 algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
the maximum block size is 900 kB. We decided to choose the block size so that
any document up to 5MB could be considered as a single block. This approach
is fairly-minded since e.g. word methods will not be favoured by considering the
document as one block whilst letter methods would split the text into several
blocks.
      </p>
      <p>
        For our purposes, we had to alter two method’s properties: We modified the
XBW method [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to be able to compress above all required entities. The XBW
? This research was partially supported by the Program ”Information Society” under
project 1ET100300419.
transform represents a labeled tree using two arrays, and supports navigation and
search operations by means of standard query operations on arrays. Since this
method was designed for syllable and word compression, it was easy to adjust it
to compress also above remaining entities (letters, 3-grams and 5-grams). Since
this method was intended as XML compression algorithm, we had to suit it on
plain text documents - it was the second modification.
      </p>
      <p>XBW method does not use syllable or word models attributes which predict
a word or syllable type alternation.</p>
      <p>In section 2 we try different strategies of document parsing. Section 3
details XBW method, section 4 describes experiments and their results. Section 5
contains the conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Parsing strategies</title>
      <p>We parse an input document into words, syllables, letters, 3-grams and 5-grams.
All above mentioned entity type division examples of string “consists of” are
introduced in the table 1.</p>
      <p>The word-based methods require parsing the input document into a stream
of words and non-words. Words are usually defined as the longest alphanumeric
strings in the text while non-words are the remaining fragments.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] the syllable is defined as: “a unit of pronunciation having one vowel
sound, with or without surrounding consonants, and forming all or part of
a word.” We do not need to follow this definition strictly. Therefore we will
use simplified definition [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: “Syllable is a sequence of speech sounds, which
contains exactly one vowel subsequence.” Consequently, the number of syllables
in certain word is equal to the number of word’s maximum sequences of vowels.
Example 1. For example, the word “wolf” contains one maximal vowel
subsequence (“o”), so it is one syllable; while word “ouzel” consists of two maximal
vowel sequences - “ou” and “e” – there are two syllables, “ou” and “zel”.
      </p>
      <p>N-grams are sequences of n letters, then 3-gram contains 3 letters, 5-gram is
created by 5 letters, 1-gram is one letter.</p>
      <p>
        XBW
We designed an XBW method based on the Burrows-Wheeler transform [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This XBW method was partially described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; in this paper, we give a more
detailed description.
      </p>
      <p>
        The XBW method was not named very appropriately, because it can be
easily mistaken by name xbw used by the authors of paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for XML
transformation into the format more suitable for searching. In another article these
authors renamed the transformation from xbw to XBW. Moreover they used it
in compression methods called XBzip and XBzipIndex.
      </p>
      <p>
        XBW method we designed consists of these steps: 1. replacement of the tag
names, 2. division into the elements (words, syllables or n-grams), 3. encoding
of the dictionary of used elements [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], 4. Burrows-Wheeler transform (BWT),
5. Move to Front transform (MTF), 6. Run Length Encoding of null sequences
(RLE), and 7. Canonical Huffman. The steps are all described also by examples.
Our tests were performed on plain texts, therefore we will not detail this method
with XML support.
3.1
      </p>
      <sec id="sec-2-1">
        <title>Replacement of the tag names</title>
        <p>SAX parser produces a sequence of SAX events processed by a structure coder.
This coder builds up two separate dictionaries for elements and attributes of
encoding.</p>
        <p>If a corresponding dictionary contains the processed element or attribute,
it is substituted by an assigned identifier. Otherwise it will be substituted by
the lowest available identifier and put into the proper dictionary. Moreover the
name of the element or attribute will be written to the output just after the new
identifier. It ensures that the dictionaries do not need to be coded explicitly and
can be reconstructed during the extraction using the already processed part.
Example 2. Suppose the input
&lt;note importance="high"&gt;money&lt;/note&gt;
&lt;note importance="low"&gt;your money&lt;/note&gt;
Then the dictionaries look like</p>
        <p>Element dictionary Attribute dictionary
EA End-of-Tag E1 importance</p>
        <p>A1 note E2 empty
and output is A1 E1 high EA money EA A1 E1 low EA your money EA.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Division into words or syllables</title>
        <p>
          Output of the previous step is divided into words or syllables as described in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
The coder then creates a dictionary of basic units (syllables or words). In this
phase, the coder creates a syllable (word) dictionary. If the dictionary contains
a processed basic unit, it is substituted by its identifier. Otherwise, it is added
to the dictionary and assigned a unique identifier. Then all occurrences in the
code are replaced by this identifier. Resulting stream is denoted as S-stream.
This part has two outputs: the S-stream and the dictionary of used basic units.
The dictionary has to be also encoded because of the document reconstruction.
Example 3. Let us continue in the previous example where the syllables in the
dictionary could have the following associations: 01 – high, 02 – mo, 03 – ney,
04 – low, 05 – your. The S-stream is then A1 E1 01 EA 02 03 EA A1 E1 04 EA
05 06 02 03 EA.
3.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>The dictionary encoding</title>
        <p>
          The previous step also generates a dictionary of words or syllables which are
used in the text. TD3 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is one of the most effective dictionary compression
methods. It encodes the whole dictionary (represented as a trie) instead of the
separate items stored inside.
        </p>
        <p>Trie compression of a dictionary (TD) is based on the trie representing the
dictionary coding. Every node of the trie has the following attributes: represents
— a boolean value stating whether the node represents a string; count — the
number of sons; son — the array of sons; extension — the first symbol of an
extension for every son.</p>
        <p>The latest implementation of TD3 algorithm employs a recursive procedure
EncodeNode3 (Figure 1) traversing the trie by a depth first search (DFS) method.
For encoding the whole dictionary we run this procedure on the root of the trie
representing the dictionary.</p>
        <p>An example is given in Figure 2. The example dictionary contains strings
".\n", "ACM", "AC", "to", and "the".</p>
        <p>
          In procedure EncodeNode3 we code only the number of sons and the distances
between the sons’ extensions. For non-leaf nodes we must use one bit to encode
whether that node represents a dictionary item (e.g. syllable or word) or not.
Leafs always represent dictionary items, so it is not necessary to code them.
Differences between extensions of sons are defined as distances between ord
function values of the extending characters. For coding the number of sons and
the distances between them we use gamma and delta Elias codes [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. (We have
tested other Elias codes too, but the best results were achieved for the gamma
and delta codes). The number of sons and distances between them can reach
the value 0 but standard versions of gamma and delta codes start from 1 —
it means that these codings do not support value 0. Therefore we use slight
modifications of Elias gamma and delta codes: gamma0(x) = gamma(x + 1)
and delta0(x) = delta(x + 1).
        </p>
        <p>Using the ord function we reorder the symbols alphabetically according to
the symbols’ types and their occurrence frequencies typical for a certain
language. While the distances between the sons are smaller than distances coded
for example by ASCII, they can be represented by shorter Elias delta codes.
12
13
14
15 }</p>
        <p>}
In our example (Figure 2) the symbols 0–27 are reserved for lower-case letters,
28–53 for upper-case letters, 54–63 for digits and 64–255 for other symbols.</p>
        <p>Additional improvement is based on the knowledge of a syllable type that is
determined by the first one or two letters of the syllable. If a string begins with
a lower-case letter (lower-word or lower-syllable), the following letters must be
lower-case too. In a trie every son of a node representing lower-case letter must
be lower-case letter as well.</p>
        <p>The same situation comes on for other-words and numeric-words: if a string
begins with an upper-case letter, we must examine the second symbol to
recognize the type of the string (mixed or upper). In our example (Figure 3.3) we
know that all sons of nodes ’t’, ’o’, ’h’, and ’e’ are lower-case letters.</p>
        <p>In this ordering (described by the function ord), each symbol type is given
an interval of potential order. Function first returns the lowest orders available
for each given symbol type.</p>
        <p>Each first node (son) has its imaginary left brother having default value 0.
If the syllable type is defined, it is possible to set the imaginary brother’s value
to the corresponding value of first. It lowers the distance values (and shortens
their codes) of the mentioned nodes.
)
o
3
)
t
h
e
1
6
0
?
?
λ
PPPPPqP
?
A
C
M</p>
        <p>30
?</p>
        <p>34
?
33
.</p>
        <p>66
?</p>
        <p>
          Take the node labeled ‘t’ in Figure 2 to describe the coding procedure
EncodeNode3. First we encode the number of its sons. The node has two sons therefore
we write gamma0(2) = 011. By writing a bit 0 we denote that the dictionary does
not containt the processed word (string “t”). The value of the first son of ’t’
is encoded as a distance between its value 3 and zero by delta0(3 − 0) = 01100.
Then the first subtrie of node ’t’ is encoded by a recursive call of the encoding
procedure on the first son of the actual node.
In a BWT step we transform the S-stream into a “better” stream. The
“better” stream means achieving some better final compression ratio. Obviously the
transform should be reversible, otherwise we could lose some information.
Specifically we achieve a partial grouping of the same input characters. This process
requires sorting of all the step input permutations. In this certain prototype we
do not use an effective algorithm described in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] but a simpler C/C++ qsort
function. The use of more sophisticated algorithm would lower the compression
time but would not affect the compression ratio. We realize that time and spatial
complexity is markedly worse as in optimal case. Since in optimal case the order
of single elements can be different, we do not mention the time complexity in
the text, it would be confusing.
        </p>
        <p>Suppose we have a sorted matrix of all input permutations: the transform
output is then composed by its last column and by the column index of input
in this matrix (Table 2).
3.5</p>
      </sec>
      <sec id="sec-2-4">
        <title>Move to Front transform</title>
        <p>
          Then the output stream of BWT is transformed by another transformation step
– MTF [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This step translates text strings into a sequence of numbers. Suppose
a numbered list of alphabet elements, then MFT reads these input elements and
writes their list order. As soon as the element is processed it is moved up to the
front of the list.
        </p>
        <p>Example 4.
alphabet: 01 02 03 04 05 06 A1 E1 EA
string: E1 06 01 02 02 E1 04 05 EA EA A1 A1 03 03
output: nothing
alphabet: 03 A1 EA 05 04 E1 02 01 06
string:
output: 76230356808080
The output of MTF phase is “76230356808080”.
3.6</p>
      </sec>
      <sec id="sec-2-5">
        <title>Run Length Encoding of null sequences</title>
        <p>One MFT step may generate long sequences of zeroes (null sequences). If
successful, the RLE step shrinks these null sequences and replaces them with a special
symbol representing a null sequence of a given length. Finally the output is
a stream of numbers and the special symbols.</p>
        <p>Unfortunately, our example does not show a proper use of RLE. Therefore
we will alter the problem and replace the string “02 01 00 00 00 03 00 07 00 00”
by “02 01 N3 03 00 07 N2”.
3.7</p>
      </sec>
      <sec id="sec-2-6">
        <title>Canonical Huffman</title>
        <p>
          After the RLE step the stream is encoded by a canonical Huffman code [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Huffman coding is based on assigning shorter codes to more frequent characters
then to characters with less frequent appearance. In our example, “76230356808080”
will have assigned the following codes:
Example 5.
0 - 00, 2 - 110, 3 - 010, 5 - 1110, 6 - 011, 7 - 1111, 8 - 10
The output is then: 1111 011 110 010 00 010 1110 011 10 00 10 00 10 00.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We compared the compression effectivity using BWT of letters, syllables, words,
3-grams and 5-grams. Testing procedures proceeded on commonly used text files
of different sizes (1 kB - 5 MB) in various languages: English (EN), Czech (CZ),
and German (GE).
4.1</p>
      <sec id="sec-3-1">
        <title>Testing data</title>
        <p>
          Each of tested languages (EN, CZ, GE) had its own plain text testing data.
Testing set for Czech language contained 1000 random news articles selected
from PDT [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and 69 books from eKnihy [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Testing set for English contained
1000 random juridical documents from [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and 1094 books from Gutenberg
project [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. For German, we used 184 news articles from Sueddeutche [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and
175 books from Gutenberg project [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>
          The primary goal was to compare the letter-based compression, syllables and
words. For files sized up to 200 kB, the letter-based compression appears to
be optimal; for files 200 kB - 5 MB syllable-based compression is the most
effective. Also used language affects the results. English has a simple morphology:
in large documents the difference between words and syllables is insignificant.
In languages with rich morphology (Czech, German) words are still about 10%
worse then syllables, even on the largest documents. Language type influence on
compression is detailed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>As the second aim we tried to compare the syllable-based compression with
3-gram-based and word-based compression with 5-gram-based compression.
Syllables as well as words are natural language units therefore we supposed using it
to be more effective than using 3-grams and 5-grams. These assumptions were
confirmed. Natural units were the most effective for small documents (20 - 30
%), by increasing the document size efficiency falls to 10 - 15 % for documents
of size 2 - 5 MB.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we compare experimentally the compression efficiency of
BurrowsWheeler transform by letters, syllables, words, 3-grams and 5-grams, using the
XBW compression algorithm. We test common plain text files of different sizes
(1kB – 5MB) in various languages (English, Czech, German). BWT block
contains the whole document.</p>
      <p>Comparing the letter-based, syllable-based and word-based compression, we
find out that character-based compression is most suitable for small files (up to
200 kB) and syllable-based compression for files of size 200 kB – 5 MB (see Table
3, the compress ratio in the table is presented in bites per byte (bpc)).</p>
      <p>The compression using natural text units like words or syllables is 10–30%
better than compression with 5-grams and 3-grams.</p>
      <p>
        To achieve the results introduced in this article, we use compression algorithm
XBW implemented mainly for testing purposes. Our next goal is to improve this
program for practical use, e.g. time complexity advance (faster sorting algorithm
in BWT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], splay trees for MTF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) or UTF-8 encoding support.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arnavut</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Move-to-front and inversion coding</article-title>
          .
          <source>Data Compression Conference</source>
          , IEEE CS Press, Los Alamitos, CA, USA (
          <year>2000</year>
          )
          <fpage>193</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bentley</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sleator</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarjan</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          :
          <article-title>A locally adaptive data compression scheme</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>29</volume>
          (
          <issue>4</issue>
          ):
          <fpage>320</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>April 1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wheeler</surname>
            ,
            <given-names>D. J.: A Block</given-names>
          </string-name>
          <string-name>
            <surname>Sorting Loseless Data Compression Algorithm</surname>
          </string-name>
          .
          <source>Technical report</source>
          , Digital Equipment Corporation, Palo Alto, CA,
          <string-name>
            <surname>U.S.A</surname>
          </string-name>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Elias</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Universal codeword sets and representation of the integers</article-title>
          .
          <source>IEEE Trans. on Information Theory</source>
          , Vol.
          <volume>21</volume>
          , (
          <year>1975</year>
          )
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luccio</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manzini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muthukrishnan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Structuring labeled trees for optimal succinctness, and beyond</article-title>
          .
          <source>Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)</source>
          , (
          <year>2005</year>
          )
          <fpage>184</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferragina</surname>
            , Manzini,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muthukrishnan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Compressing and Searching XML Data Via Two Zips</article-title>
          .
          <source>Proc. WWW</source>
          <year>2006</year>
          , Edinburgh, Scotland. (
          <year>2006</year>
          )
          <fpage>751</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Galambos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lansky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chernik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Compression of Semistructured Documents</article-title>
          .
          <source>In: International Enformatika Conference IEC</source>
          <year>2006</year>
          , Enformatika, Transactions on Engineering,
          <source>Computing and Technology</source>
          , Volume
          <volume>14</volume>
          , (
          <year>2006</year>
          )
          <fpage>222</fpage>
          -
          <lpage>227</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Isal</surname>
            ,
            <given-names>R.Y.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Parsing Strategies for BWT Compression</article-title>
          . Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (
          <year>2001</year>
          )
          <fpage>429</fpage>
          -
          <lpage>438</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Isal</surname>
            ,
            <given-names>R.Y.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Word-based Block-sorting Text Compression</article-title>
          .
          <source>Proc. 24th Australasian</source>
          Computer Science Conference, Gold Coast, Australia, (
          <year>2001</year>
          )
          <fpage>92</fpage>
          -
          <lpage>99</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lansky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemlicka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Compression of a Dictionary</article-title>
          . In: Snasel,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Richta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            , and
            <surname>Pokorny</surname>
          </string-name>
          , J.:
          <source>Proceedings of the Dateso 2006</source>
          Annual International Workshop on DAtabases, TExts, Specifications and Objects.
          <source>CEUR-WS</source>
          , Vol.
          <volume>176</volume>
          , (
          <year>2006</year>
          )
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lansky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemlicka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Text Compression: Syllables. In: Richta,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Snasel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Pokorny</surname>
          </string-name>
          , J.:
          <source>Proceedings of the Dateso 2005</source>
          Annual International Workshop on DAtabases, TExts, Specifications and Objects.
          <source>CEUR-WS</source>
          , Vol.
          <volume>129</volume>
          , (
          <year>2005</year>
          )
          <fpage>32</fpage>
          -
          <lpage>45</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turpin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On the implementation of minimum redundancy prefix codes</article-title>
          .
          <source>IEEE Trans.Comm</source>
          .
          <volume>45</volume>
          (
          <year>1997</year>
          ),
          <fpage>1200</fpage>
          -
          <lpage>1207</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Seward</surname>
          </string-name>
          , J.:
          <article-title>On the Performance of BWT Sorting Algorithms</article-title>
          . DCC, IEEE CS Press, Los Alamitos, CA, USA (
          <year>2000</year>
          )
          <fpage>173</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Seward</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The bzip2 and libbzip2 official home page</article-title>
          . http://sources.redhat. com/bzip2/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>15. e-books. http://go.to/eknihy</mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Project</given-names>
            <surname>Gutenberg</surname>
          </string-name>
          . http://www.promo.net/pg
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>17. Compact Oxford English Dictionary. http://www.askoxford.com</mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sueddeutsche</surname>
          </string-name>
          . http://www.sueddeutsche.de
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>19. California Law. http://www.leginfo.ca.gov/calaw.html</mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>20. The Prague Dependency Treebank. http://ufal.mff.cuni.cz/pdt/</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>