<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IImmpprroovveemmeenntt ooff text Ccoommpprreessssiioonn parameters Text Parameters Uussiinngg Cclluusstteerr aAnnaalylysissis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiˇr´ı Dvorsky´</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Martinoviˇc Jiˇr´ı Dvorsky´</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Martinoviˇc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, VB - Technical University Ostrava</institution>
          ,
          <addr-line>Dep1t7..olfisCtoopmapduute1r5,S7c0ie8nc3e3,, VOS</addr-line>
        </aff>
      </contrib-group>
      <fpage>115</fpage>
      <lpage>126</lpage>
      <abstract>
        <p>Several actions are usually performed when document is appended to textual database in information retrieval system. The most frequent actions are compression of the document and cluster analysis of the textual database to improve quality of answers to users' queries. The information retrieved from the clustering can be very helpful in compression. Word-based compression using information about cluster hierarchy is presented in this paper. Some experimental results are provided at the end of the paper.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The modern information society produces immense quantities of textual
information. Storing text effectively and searching necessary information in stored texts
are the tasks of Information Retrieval Systems (IRS). Information retrieval
systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] constitute a class of program tools for processing, storing and selecting
data that are texts. An IRS is accessed by a user who needs to obtain certain
information (document) from this system to solve a problem. Such information
is called relevant. Various documents are suitable to users to various extents.
Therefore, we also speak of a document relevancy ratio. When searching
information in an IRS, a system user submits his or her requirement, a query, and
awaits a result in the form of a set of documents selected by the system as
documents matching the user requirement, i.e. matching the user’s query.
      </p>
      <p>It is clear that the size of an IRS increases with the increasing size of available
external memories. The information explosion can be avoided basically in two
ways:
1. Extensively - by purchasing higher capacity memories, or
2. Intensively - by storing data in memories in a better way.</p>
      <p>The first solution is not interesting in terms of research. The key to the
second solution is data compression. The database of a typical IRS is a textual
database, which stores all information that is necessary for the function of the
IRS. Textual databases typically consist of the three following parts:
– full texts from documents that form a document collection
– data structures for searching documents
– list of document identifiers and of their attributes and other auxiliary
structures</p>
      <p>
        Haskin claims in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] that the size of textual database auxiliary structures
makes up 50% to 300% of the size of original documents, this implies that a
textual database is a suitable material for compression. It is only necessary to
use the several lossless compression methods to save space.
      </p>
      <p>However, the problem of compression in IRS is not as simple as it seems at
first sight. On the one hand, compression saves space for data, while, on the other
hand, it may entail a certain operation overhead (i.e. adding certain amount of
time to the cost of accessing the data). Also, the space saving must be significant
to be useful. Therefore, the objective is not to compress the textual database as
a whole. This usually does not lead to good results since individual parts of an
IRS contain redundancies of different types; different data structure types are
based on a different model, according to which it is possible to determine the
best compression method.</p>
      <p>Experience shows that it is useful to consider, analyze and design the best
compression method when storing extensive textual databases. It also proves to
be desirable to study highly specialized compression methods that are convenient
only for a certain data type in an IRS. Tens of megabytes can be saved either
saving one byte in data structures or by improving compression ratio of text
compression in an IRS.</p>
      <p>
        This paper will focus on text compression methods suitable for IRS. Factors
significantly influencing compression methods that are suitable for IRS include:
compression ratio, decompression speed, the possibility of decompressing the
document, and connection with searching [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>The aim of this paper is to extend of existing word-based compression methods
with hierarchical agglomerative clustering to improve compression performance,
especially compression ratio.</p>
      <p>The paper is organized as follows. Sections 2 and 3 provide an outline about
word-based compression methods. Section 4 briefly describes clustering methods.
In section 5 characterization of our approach is provided. Experimental results
are discussed in section 6. Short conclusion is provided in section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Word-based compression</title>
      <p>The compression algorithm transforms input data that contains a certain
redundancy to output data, in which redundancy is reduced to a minimum. The
input and the output of data compression algorithms are generally strings of
characters over a certain alphabet. There are no requirements concerning the
alphabet. The selection of the alphabet is therefore a question of choice, which
is influenced by various perspectives.</p>
      <p>
        The search for a suitable alphabet will proceed from the syntactical structure
of natural languages: character → syllable → word → sentence. Pertaining to
this structure a certain correspondence can be discovered between the language
structure and possible alphabets. Each level represents one potential alphabet.
The first level is represented by a character-based alphabet. The next possible
level is an alphabet of syllables. Here we face the problem of identifying syllables
in the text [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Much greater possibilities are offered by the third level –
wordbased alphabet.
      </p>
      <p>A compression method based on an alphabet of words, which will be called
the word-based compression method, regards text as a sequence of words1 in a
certain language. The application of irregular distribution of individual word
occurrence probabilities is then assumed during compression in statistical
compression methods, or the clustering of words into language syntactical structures
is assumed in dictionary methods. It is presupposed that the language structure
controls not only characters but also words. Here are some examples:
– fixed phrases, e.g. ”How do you do?”
– constructions based on grammar, e.g. constructions with an article - ”the
best”, phrasal verb - ”to be interested in”
– constructions based on the contents of the text, e.g. ”data compression”,
”word based” are frequently repeated in this paper</p>
      <p>It is also presupposed that these constructions are repeated and that it is
possible to achieve a certain compression on the basis of this repetition. It is not
presupposed that the text consist only of hapax legomena2 – even though this
assumption can be used as well.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Word-based compression methods</title>
      <p>
        The first widely accessible description is that of Bentley et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (see also Ryabko
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]), who proposed that a dictionary of words parsed from the text should
be coupled with codewords that correspond to MTF numbers. Moffat [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] also
experimented with word-based models, and showed that for a range of data files
the MTF transformation was less effective than a straightforward entropy code
in those experiments, arithmetic coding. A similar word-based model is available
as part of the arithmetic coding implementation of Moffat et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Moffat [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] also investigated first-order and second-order word-based models,
in which one or two words are used to condition the probability distribution used
by the entropy coding stage (respectively). Other authors have made use of
wordbased models since then including Horspool and Cormack [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Zobel and Moffat
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], and Moffat et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. de Moura et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have extended the idea of
wordbased compression to what is called the spaceless words approach, which can be
considered as special case of eliminating of victim [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
1 Sequences of spaces, punctuation between two words is called nonword. HTML or
XML tags are considered as the third part of word-based alphabet in the case of
compression of HTML/XML document. Words, nonwords and tags are called tokens
in general.
2 Hapax legomenon – a word with only one occurrence in the examined text.
      </p>
      <p>
        Huffword compression method was designed by Moffat and Zobel in 1994
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. HuffWord is a compression method that is specialized in texts and uses a
word-based alphabet. The compression is based on the so-called Huffman canonic
coding. The authors of the HuffWord claim a compression ratio of about 30%.
3.1
      </p>
      <sec id="sec-3-1">
        <title>WLZW and WBW methods</title>
        <p>
          The beginning of the WLZW method dates back to 1998 when its first variant
and the first results were published [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ]. Other modifications and results can be
found in [
          <xref ref-type="bibr" rid="ref10 ref11 ref13">10, 13, 11</xref>
          ]. The WBW method is newer and its beginning dates back
to 2001 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Scheme of WLZW and WBW compression algorithms Figure 1(a) shows
a schematic structure of compression algorithms WLZW and WBW. The figure
clearly shows that both compression methods can be roughly divided into two
parts that are named front end and back end. Both compression methods process
document texts in two passes. The division of compression methods into two
parts corresponds with those passes. Some parts are not active at all in the
individual passes or their activity is different. Two algorithm phases can be
distinguished according to the order of passing through document texts in both
compression algorithms:
– First phase - corresponds to the first pass of the compression algorithm. A
word-based alphabet is created in this phase. Individual tokens are extracted
from documents through the process of lexical analysis, which is implemented
by the front end part. This phase is shared with document indexing in the
textual database.
– Second phase - corresponds to the second pass of the compression algorithm.</p>
        <p>A complete word-based alphabet is available upon the completion of the first
phase and the actual document compression can begin. A lexical analysis is
again performed and the token sequence that is being created is compressed
by a chosen algorithm. Both phases of the compression algorithm, the front
end and the back end, are already active in this phase.</p>
        <p>The division of the compression algorithm into two relatively independent
parts made it possible to separate two different compression algorithm phases, i.e.
the creation of a word-based alphabet and the actual compression. Naturally, this
separation has simplified the algorithm design, it has made the implementation
more transparent, etc.</p>
        <p>Scheme of WLZW and WBW decompression algorithms Figure 1(b)
shows a structure of the decompression algorithm of WLZW and WBW methods.
As the figure clearly shows, both decompression algorithms can be divided into
two parts - front end and back end, like the compression algorithms. WLZW and
WBW methods were constructed as asymmetric, from whence it follows that:
Docs</p>
        <p>Compressionalgorithm
Frontend</p>
        <p>Lexical
Analyzer</p>
        <p>Tokens</p>
        <p>Idents</p>
        <p>LZW
compressor
Backend</p>
        <p>Token table</p>
        <p>Tokens
Tokens Preprocessor</p>
        <p>Tokens’</p>
        <p>Idents
Tokens</p>
        <p>Idents
comBprWessor</p>
        <p>BSTW
compressor
Compressed</p>
        <p>Docs
(a) Compressor</p>
        <p>Alphabet
Aux
data</p>
        <p>Alphabet
Aux
data
Compressed
Docs</p>
        <p>Decompressionalgorithm
Frontend</p>
        <p>LZW
decompressor
Backend</p>
        <p>IRS searching
algorithm</p>
        <p>DocID</p>
        <p>Token table
TnoukmenbserIsD TIdoekentnss</p>
        <p>Postprocessor Tokens
decomBpWressor</p>
        <p>BSTW
decompressor
(b) Decompressor</p>
        <p>Decompressed
document
– Decompression is simpler than compression. All operations that can be
performed by the compression algorithm are transferred to this algorithm, so
that the decompression algorithm performs only the necessary operations.
– Decompression involves only one phase. The decompression of a document
requires only one pass through the compressed text. All objects contained in
scheme 1(b) are therefore active during decompression and the
decompression progress has a ”through-flow” character.</p>
        <p>The division of the decompression algorithm into two parts enabled the
separation of the actual decompression and the subsequent reconstruction of the
document text from a sequence of token identifiers.</p>
        <p>Furthermore, this division made it possible to see the compression and the
decompression algorithms as two dual algorithms with a similar internal
structure.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Cluster analysis</title>
      <p>
        Finding of groups of objects with the same or similar features within given set
of objects is the goal of cluster analysis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These groups are called clusters. In
our case objects are equal to documents that will be stored in textual base, and
clusters are equal to groups of similar documents. First of all the distance of two
documents and distance matrix C for each pair of documents should be defined.
Our approach of cluster analysis is based on ultrametric tree [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Ultrametric</title>
        <p>
          The triangular inequality holds for a metric space: d(x, z) ≤ d(x, y) + d(x, z) for
any triplet of points x, y, z. In addition the properties of symmetry and positive
definiteness are respected. The ultrametric inequality is: d(x, z) ≤ max{(d(x, y), d(x, z)}
for any triplet x, y, z [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          Definition 1. A metric space (X, d) is called ultrametric if for all x, y, z ∈ X
we have d(x, z) ≤ max{d(x, y), d(y, z)} [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Definition 2. The ball on the ultrametric is BX (x, r) = {z ∈ X|d(x, z) ≤ r}
for a point x ∈ X and r ≥ 0.</p>
        <p>An ultrametric tree is a rooted tree whose edges are weighted by a
nonnegative number such that the lengths of all the root-to-leaf paths, measured by
summing the weights of the edges, are equal. A distance matrix C is ultrametric
if an ultrametric tree can be constructed from this matrix. Figure 2 shows an
example of an ultrametric matrix and an ultrametric tree constructed from this
matrix.</p>
        <p>
          As is well known, in clustering a bijection is defined between a rooted, binary,
ranked, indexed tree, called a dendrogram, and a set of ultrametric distances [
          <xref ref-type="bibr" rid="ref17 ref23">23,
17</xref>
          ].
Definition of hierarchical aglomerative clustering method outgoing from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
which compute ultrametric tree from distance matrix C can described as:
1. Create distance matrix C.
2. At the beginning each object is considered as one cluster i.e. there are as
many clusters as objects. Sequentially, clusters are joined together and
number of clusters drops down, when finally there is one cluster.
3. The most similar clusters p and q are found in distance matrix and their
distance is determined as proxs[p, q].
4. Clusters p and q are joined together and the number of cluster is reduced.
        </p>
        <p>New formed cluster is determined as t (it replaces row and column q), and
distance proxs[t, r] of new cluster t to all other clusters r are computed. For
Average method proxs[t, r] is defined as:
proxs[t, r] =</p>
        <p>Npproxs[p, r] + Nqproxs[q, r]</p>
        <p>Np + Nq
Then row and column p, corresponding to cluster p, is deleted form the
distance matrix C, i.e. size of distance matrix is reduced.
5. If there are more than one cluster, go to step 3
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Compression with clustering support</title>
      <p>Ordering of input documents was not taken in consideration in general
description of word-base compression methods. The compression method works
correctly for any ordering of documents. Probably the simplest ordering of input
documents is time ordering, i.e. the documents are compressed in the same
order as they are added to textual database. Seeing that compression methods
are based on searching of repeated parts of texts, it is easy to see, that this
ordering is not necessary the best possible. Improvement of compression
performance can be achieved by reordering of input documents. Better ordering of
input documents moves similar documents to one another.</p>
      <p>
        Similar documents are grouped together using cluster analysis 4. Of course
cluster analysis is very time consuming so that it is counterproductive to perform
the analysis only to enhance compression performance. But when compression
method for IRS is developed, results of cluster analysis can be used in query
processing [
        <xref ref-type="bibr" rid="ref19 ref8">8, 19</xref>
        ] and vice versa, cluster analysis originally devoted to query
processing can be incorporated to compression.
      </p>
      <p>To group similar documents together, agglomerative clustering algorithm
described in section 4 was used. But the question how to convert hierarchical tree
structure of clusters to linear list of documents still remains. The ultramnetric
tree was created during clustering. We can used this fact and for list of
documents LX for compression used ultrametric ball query: BX (x, r), where r is
maximal distance in ultrametric tree. LX be sorted before compression aided
distance d(x, z) where z ∈ LX .</p>
      <p>Two strategies were used to reorder collection of documents entering the
compression process:
Most Similar Left (MSL) – x in BX (x, r) is leftmost document in the
ultrametric tree.</p>
      <p>Most Similar Right (MSR) – x in BX (x, r) is rightmost document in the
ultrametric tree.</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>
        Some experiments were done to test impact clustering on word-based
compression methods. Both compression methods were used in our tests. Two large text
files were used for our tests: latimes.txt coming from TREC corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and
enron.txt, which consists of emails from Enron email corpus3. In file latimes.txt
individual documents are represented by each newspapers article and ordering is
determined by date of publication. Each individual email represents document in
file enron.txt, and ordering is defined as alphabetical ordering of users in Enron
corpus. Results for this type of ordering without ordering is provided in Table 1.
      </p>
      <p>The following notation will be used to describe results of experiments:
– CS is the size of compressed file
– CSα, where α ∈ {W LZW, W BW, GZIP, BZIP 2} is the size of compressed
file without clustering, see Table 1
– ΔCS = CSCαS−αCS × 100%
– CR = CSS0 × 100% is compression ratio
– ΔCR = CRα − CR, where α ∈ {W LZW, W BW, GZIP, BZIP 2}
Δ values represents difference between given value and corresponding value in
compression without clustering. Positive Δ value means that given value is worse
than original value, negative value means than new value is better than original
one.
latimes.txt enron.txt
498,360,166 886,993,953
Original size [bytes] S0
WLZW method
Compressed size [bytes] CSW LZW 158,017,940 207,908,560
Compression ratio [%] CRW LZW 31.708 23.440
WBW method
Compressed size [bytes] CSW BW 110,246,524 167,099,129
Compression ratio [%] CRW BW 22.122 18.839
Gzip
Compressed size [bytes] CSGZIP 175,864,812 228,953,895
Compression ratio [%] CRGZIP 35.289 25.812
BZip2
Compressed size [bytes] CSBZIP 2 131,371,338 164,720,382</p>
      <p>Compression ratio [%] CRBZIP 2 26.361 18.571</p>
      <p>The first experiments are focused on the file latimes.txt. This file is relatively
large. The size of documents (newspapers articles) varies from two to eight
kilobytes. Compression with clustering and five random permutations were tested.
3 Duplicate emails were deleted before processing.</p>
      <p>It is easy to see from Table 2, that clustering brings positive results in terms
of compression ratio. The size of the compressed text is about 4% less than the
original size in the WLZW methods, and about 5% smaller than the original one
in the WBW method. The compression ratio improves to cca 1.2% with respect
to original values in both cases.
Better results were achieved for file enron.txt, see Table 2. The improvement
of compression ratio is cca 2 % with respect to the original compressed size in
the WLZW method, and cca 4 % in the WBW method.</p>
      <p>Random permutations deteriorate compression in all cases (see Tables 3,
and 4). These negative results mean that clustering has measurable impact on
compression performance, and the positive results of regarding cluster supported
compression are not coincidental.</p>
      <p>The results of standard GZip and BZip2 compression utilities provide data
for comparison with our proposed word-based compression methods. As can be
seen from tables, character of these results is very close to our methods; therefore
clustering has serious impact on compression regardless of selected compression
method.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and future works</title>
      <p>Word-based compression methods combined with cluster analysis of input
document have been presented in this paper. These compression methods are suitable
especially for IRS. Experimental results prove that clustering has a positive
impact on the compression ratio.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement References</title>
      <p>This work is supported by Grant of Grant Agency of Czech Republic No. 201/05/P145.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern Information Retrieval</article-title>
          . Addison Wesley, New York,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Bentley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Sleator</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Wei</surname>
          </string-name>
          .
          <article-title>A locally adaptive data compression scheme</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>29</volume>
          (
          <issue>4</issue>
          ):
          <fpage>320</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Berry</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Browne</surname>
          </string-name>
          .
          <source>Understanding Search Engines. SIAM Society for Industrial and Applied Mathematics</source>
          , Philadelphia,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>N.</given-names>
            <surname>Brodskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dydak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Higes</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>Dimension zero at all scales</article-title>
          .
          <source>ArXiv Mathematics</source>
          e-prints,
          <year>July 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. E. S. de Moura, G. Navarro,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          .
          <article-title>Fast searching on compressed text allowing errors</article-title>
          . In W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>C. J. van Rijsbergen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          , and J. Zobel, editors,
          <source>Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>298</fpage>
          -
          <lpage>306</lpage>
          . ACM Press, New York,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. J. Dvorsky´.
          <article-title>Word-based compression methods for text retrieval systems</article-title>
          .
          <source>In WDS'98</source>
          ,
          <string-name>
            <surname>Praha</surname>
          </string-name>
          ,
          <year>1998</year>
          . ISBN 80-85863-29-4.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. J. Dvorsky´.
          <article-title>Word-based Compression Methods for Information Retrieval Systems</article-title>
          .
          <source>Phd thesis</source>
          , Charles University Prague,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. J. Dvorsky´,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinovic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Sna</surname>
          </string-name>
          <article-title>´sel. Query expansion and evolution of topic in information retrieval systems</article-title>
          . In V. Sn´asel, J. Pokorny´, and K. Richta, editors,
          <source>DATESO</source>
          , volume
          <volume>98</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <fpage>117</fpage>
          -
          <lpage>127</lpage>
          . CEURWS.org,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. J. Dvorsky´, J. Pokorny´, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Sna</surname>
          </string-name>
          <article-title>´ˇsel. Word-based compression methods with empty words and nonwords for text retrieval systems</article-title>
          . In Datasem'
          <volume>98</volume>
          ,
          <string-name>
            <surname>Brno</surname>
          </string-name>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. J. Dvorsky´, J. Pokorny´, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Sn</surname>
          </string-name>
          <article-title>´aˇsel. Compression methods for large multilingual text document</article-title>
          .
          <source>In Proceedings of 1999 International Symposium on Database, Web and Cooperative Systems</source>
          , pages
          <fpage>158</fpage>
          -
          <lpage>163</lpage>
          , Baden-Baden,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. J. Dvorsky´, J. Pokorny´, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Sn</surname>
          </string-name>
          <article-title>´aˇsel. Word-based compression methods and indexing for text retrieval systems</article-title>
          . In J. Eder,
          <string-name>
            <surname>I. Rozman</surname>
          </string-name>
          , and T. Welzer, editors,
          <source>Proceedings of ADBIS 99, number 1691 in Lecture Notes in Computer Science</source>
          , pages
          <fpage>75</fpage>
          -
          <lpage>84</lpage>
          . Springer-Verlag, Berlin,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Dvorsky</surname>
            ´ and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Sn</surname>
          </string-name>
          <article-title>´aˇsel. Modifications in Burrows-Wheeler compression algorithm</article-title>
          .
          <source>In Proceedings of ISM 2001</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>35</lpage>
          , Ostrava,
          <year>2001</year>
          . ISBN 80-85988- 51-8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. J. Dvorsky´,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Sn´aˇsel, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          <article-title>´. Word-based compression methods for large text documents</article-title>
          .
          <source>In Data Compression Conference - DCC '99</source>
          ,
          <string-name>
            <surname>page</surname>
            <given-names>523</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snowbird</surname>
          </string-name>
          , Utah, USA,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. D. Harman, editor.
          <source>The Forth REtrieval Conference (TREC-4)</source>
          .
          <source>National Inst. of Standards and Technology</source>
          , Gaithersburg, USA,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Haskin</surname>
          </string-name>
          .
          <article-title>Special-purpose processors for text retrieval</article-title>
          .
          <source>Database Engineering</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>16</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Horspool</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          .
          <article-title>Constructing word-based text compression algorithms</article-title>
          . In J. A. Storer and M. Cohn, editors,
          <source>Proc. 1992 IEEE Data Compression Conference</source>
          , pages
          <fpage>62</fpage>
          -
          <lpage>71</lpage>
          . IEEE Computer Society Press, Los Alamitos, California, Mar.
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. S. Johnson.
          <article-title>Hierarchical clustering schemes</article-title>
          .
          <source>Psychometrika</source>
          ,
          <volume>32</volume>
          :
          <fpage>241</fpage>
          -
          <lpage>254</lpage>
          ,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lansky</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zemlicka</surname>
          </string-name>
          . Text compression: Syllables. In Richta et al. [
          <volume>24</volume>
          ], pages
          <fpage>32</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinovic</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Gajdos</surname>
          </string-name>
          .
          <article-title>Vector model improvement by fca and topic evolution</article-title>
          . In Richta et al. [
          <volume>24</volume>
          ], pages
          <fpage>46</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          .
          <article-title>Word-based text compression</article-title>
          .
          <source>Software-Practice and Experience</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>185</fpage>
          -
          <lpage>198</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Neal</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Witten.</surname>
          </string-name>
          <article-title>Arithmethic coding revisited</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>256</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Sharman</surname>
          </string-name>
          .
          <article-title>Text compression for dynamic document databases</article-title>
          .
          <source>Knowledge and Data Engineering</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ):
          <fpage>302</fpage>
          -
          <lpage>313</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          .
          <article-title>On ultrametricity, data coding, and computation</article-title>
          .
          <source>Journal of Classification</source>
          , Volume
          <volume>21</volume>
          , Number 2 / September:
          <fpage>167</fpage>
          -
          <lpage>184</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>K.</given-names>
            <surname>Richta</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Sn´asel, and</article-title>
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pokorny´, editors.
          <source>Proceedings of the Dateso 2005</source>
          Annual International Workshop on DAtabases, TExts, Specifications and Objects, Desna, Czech Republic,
          <source>April 13-15</source>
          ,
          <year>2005</year>
          , volume
          <volume>129</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Ryabko</surname>
          </string-name>
          .
          <article-title>Technical correspondence: A locally adaptive data compression scheme</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>30</volume>
          (
          <issue>9</issue>
          ):
          <fpage>792</fpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. I.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moffat</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bell</surname>
          </string-name>
          . Managing Gigabytes:
          <article-title>Compressing and Indexing Documents and Images</article-title>
          . Van Nostrand Reinhold,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          .
          <article-title>Adding compression to a full-text retrieval system</article-title>
          .
          <source>Software-Practice and Experience</source>
          ,
          <volume>25</volume>
          (
          <issue>8</issue>
          ):
          <fpage>891</fpage>
          -
          <lpage>903</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>