<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Practical Word-based Text Compression Using the Reverse Multi-Delimiter Codes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anatoly V. Anisimov</string-name>
          <email>a.v.anisimov@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor O. Zavadskyi</string-name>
          <email>ihorzavadskyi@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timofey S. Chudakov</string-name>
          <email>timofey.chudakov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>4d Glushkov Ave., Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>decoding. We present the technique of word-level natural language text compression improving the performance of modern powerful achievers, such as 7z or BC-ZIP. It is based on the use of the Reverse Multi-Delimiter codes in combination with special preprocessing of a text and its dictionary. Also, we construct a very fast decoding algorithm for RMD-codes operating almost at the same speed as SCDC and times faster than Fibonacci codes Compression of large textual databases is one of the key elements of modern information retrieval systems. Text compression techniques can be divided into 2 groups: methods operating individual characters as alphabet elements and methods using words as atomic symbols. We focus on methods of the second group since they rely on partitioning texts (at least in European languages) in most natural semantic units and thus provide significantly better compression ratios. Well-known classical solutions based on entropy encoding, such as Huffman codes</p>
      </abstract>
      <kwd-group>
        <kwd>Word-based</kwd>
        <kwd>compression</kwd>
        <kwd>archiver</kwd>
        <kwd>code</kwd>
        <kwd>multi-delimiter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], can be applied to the word-level text
compression and provide compression ratios close to the theoretical limit defined by Shannon’s entropy.
However, not only the compression ratio matters but also such features as fast search in compressed
data, high decoding speed, and code robustness in the sense of limiting possible error propagation. As
is known, Huffman codes are not well suited for such requirements.
      </p>
      <p>
        An alternative approach stems from the use of variable-length codes with delimiters, such as
Fibonacci [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], (s,c)-dense [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
      </p>
      <sec id="sec-1-1">
        <title>Multi-Delimiter, or Reverse</title>
      </sec>
      <sec id="sec-1-2">
        <title>Multi-Delimiter (RMD) [4] codes.</title>
        <p>Delimiters are special bit sequences denoting the beginning or the end of a codeword. Properly chosen
delimiters guarantee that any portion of an encoded bitstream can be uniquely decoded independently
of the surrounding context. This implies such remarkable property of a code as synchronizability and
also allows us to make the fast Boyer-Moore-style pattern search in a compressed file.</p>
        <p>
          Of course, these properties are achieved at the cost of compression ratio. When we use a
characterbased alphabet, the price is high enough. However, a word-based alphabet contains rather more
elements and distribution of their frequencies is flatter, which equates the compression performance of
different codes. As shown in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], 100 MB English text file compressed by SCDC on a word alphabet
may exceed Shannon's H0 entropy by 14.5%, Fibonacci code Fib3 – by 4.4%, and RMD-codes – by
2.8%. Let us note that the mentioned compression ratios characterize codes themselves, without
auxiliary structures, such as dictionaries required for encoding and decoding. In contrast to
characterlevel, in a word-level text compression, a dictionary is a more significant part that should be stored
together with the compressed file. It can be represented as the sequence of pairs (word of a text,
codeword), where codewords are sorted in ascending order of their lengths, while words are sorted in
descending order of their frequencies. However, for monotonous codes (e.g. SCDC, Fibonacci, and
        </p>
        <p>2022 Copyright for this paper by its authors.
RMD but not MD-codes), it is enough to store only the sequence of different words of a text, since for
a given word its codeword can be easily calculated from the word number and vice versa.</p>
        <p>
          The uncompressed word-level dictionary for 1 GB English text occupies about 2.5–3% of a text
itself and about 5% of 100 MB text. If we compare the size of a compressed dictionary with the
compressed text, the percentage becomes even higher. By the other hand, a text already consists of all
dictionary elements, giving us an idea of saving the space by special marking dictionary elements in
the text. Also, some other regularities of natural language word sequences can be exploited. This leads
us to constructing a word-level text preprocessor, that is placed in front of a standard postcompressor
to achieve a higher compression ratio. We describe such preprocessor in Section 2. Although it can be
postcompressed by any monotonous variable length code, in experiments we use RMD-codes as they
provide the best compression ratio among other codes with delimiters. Notably, the described
preprocessing technique preserves the synchronizability of a code, the possibility of fast decompressing,
and, to some extent, of a compressed search. Also, it is worth to note that this approach improves not
only the performance of variable-length codes but also of known powerful archivers if we apply them
after preprocessing + RMD encoding. Moreover, using RMD-codes to encode the lengths and offsets
of segments in the LZ77 compression scheme can improve the results of high-order character-based
entropy compression, as shown in Section 6 on the example of the open-source archiver BC-ZIP [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          As mentioned above, another important property of compression codes is a decoding speed. The
(s,c)-dense codes are specially intended for fast decoding and considered a champion solution by this
parameter, if not considering the recently invented Binary Coded Ternary encoding [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and Byte Codes
with Restricted Prefix Properties [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], although the latter do not possess other properties of codes with
delimiters. In Section 3 we suggest a very fast decoding method for RMD-codes operating almost at
the same speed as SCDC decoding and even faster for short texts. It essentially improves our previous
decoding technique given in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Let us define an RMD-code. Assume  1, … ,   is the ascending
sequence of natural numbers. The codeword set of the RMD-code   1,…,  consists of codewords of
the form 01  ,  = ̅1̅,̅̅, and also codewords that:
 start from the sequence 01  0,  = ̅1̅,̅̅ and do not contain any of these sequences anywhere
else in a codeword;
 do not end with a sequence 01  ,  = ̅1̅,̅̅.
        </p>
        <p>The bit sequences of the form 01  0 can be considered as delimiters. Although such delimiters do
not belong to codewords of the form 01  , these codewords constitute the delimiters together with the
leading 0 bit of the next codeword.</p>
        <p>
          In this paper, we use only RMD-codes with the infinite number of delimiters as they demonstrate
better compression ratio [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. By   1,…,  −∞ we denote the RMD-code having the delimiters with
 1, … ,   or greater number of ones. E.g. R2,4–∞ is the code with delimiters 0110 and 01t0, where t ≥ 4,
while R2–∞ is the code with all delimiters consisting of 2 or more ones.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Text preprocessing</title>
      <p>
        As usual, in practical natural language text compression, a preprocessor is placed in front of a
standard postcompressor to achieve a higher compression ratio. The most known preprocessors, such
as WRT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], StarNT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], LIPT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and others use a character-based alphabet and exploit such
regularities as q-gram compression, treating uppercase letters as lowercase, separate compression of
typical word suffixes and roots, etc. We have built our own preprocessor for word-based alphabets. It
assumes storing the dictionary together with the encoded text, postcompressing a document with
RMDcodes, and after that, applying some powerful achievers, such as 7zip.
2.1.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Low frequency word collapsing</title>
      <p>Let us give some considerations on redundancy related to words with frequency 1. In fact, each word
we store twice: the first time in the dictionary and the second time as a codeword in the encoded text.
We could avoid storing a codeword by replacing it with an original word in the text but then additional
issues with marking the beginning and the end of a word arise. Therefore, we suggest the following
scheme:
1. Sort the dictionary part consisting of the words of frequency 1 in the order they occur in the
text.</p>
      <p>2. Encode all words of frequency 1 in the text with the same short codeword, say code1.
3. While decoding, replace each code1 with the next item from that part of the dictionary.</p>
      <p>As a result, to store each word of frequency 1 we use |word|+|code1| bits instead of
|word|+|code(word)| bits. The frequency of the codeword code1 is equal to the number of words of
frequency 1, which defines its position in the dictionary. Thus, we get |code1| ≤ |code(word)|, and the
described technique can save space. This technique can be generalized to words of frequency more than
1. During encoding, when we meet a word of frequency k&gt;1 first time, replace it by the codeword codek,
and encode the next occurrences of this word with its original codeword.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Word capitalization</title>
      <p>This transformation operates as follows. First, a new placeholder is added to the dictionary. This
placeholder will serve as an indication of an anomaly in the text. Then all consecutive pairs of words
are processed. If the first word in a pair ends with the dot character '.', possibly, it is the end of the
sentence. In this case, we check if the first letter of the second word is in uppercase. If so, the frequency
of the lower-cased word is compared with the frequency of its unmodified version. If the lower-cased
version has a higher frequency, the word is substituted with the lower-cased version. Otherwise, no
action is taken. Consider what happens when the second word starts from a lowercase letter. The
decoder won’t be able to make a correct decision just by looking at the word itself. Then the anomaly
placeholder is inserted before this word to make the text uniquely decodable. The first character of the
second word in the pair may be neither in lowercase nor in uppercase. If so, the encoder takes no action.</p>
      <p>During the decoding phase, we check all consecutive pairs of words in the text and determine if such
a pair is a sentence transition. If so, the second word is checked if it is the anomaly placeholder. If so,
this placeholder is removed from the text. If not, it is checked if the first character of the second word
is a lowercase letter. If it holds true, it is converted to uppercase, otherwise, no action is taken. Note
that this transformation may convert all occurrences of some word to the lowercase. Then the dictionary
may contain a word with zero frequency. This word might be removed from the dictionary later.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Local dictionary</title>
      <p>Consider all words of the text which follow some chosen word. This subset of the text we denote by
S and call a local dictionary, while the seed-word we call a sentinel word. The word frequency
distribution in the local dictionary differs from the global one. It is possible to decrease the global text
entropy by choosing sentinels among high-frequency text words and cleverly recoding local
dictionaries. Namely, form the set SL of size L consisting of codes of elements of S with the highest
local frequencies. After the sentinel word, the elements of SL are encoded with codewords code(0),…,
code(L–1). All remaining words from S\SL are encoded as follows: if the position of a word in the global
dictionary is less than L, a special anomaly code is inserted behind its codeword, otherwise, the
codeword is left intact. The anomaly code must be chosen depending on the frequency of the anomaly
which can be either less or greater than L. The decoding is done similarly to the encoding. For a given
codeword code(x) after the sentinel word, if x&lt;L, code(x) is mapped back. If it is equal to the anomaly
code, it’s deleted, otherwise, the code is left intact.</p>
      <p>It is possible to use several local dictionaries in parallel, one for each sentinel word. Each local
dictionary together with its sentinel word, the parameter L, and the anomaly code should be stored as a
part of a compressed binary sequence. Of course, as the frequency of a sentinel word is higher, the local
dictionary becomes more cost-efficient. And, for a given local dictionary, there exists some optimal
value of L, when its further increasing leads to many anomaly codes. Therefore, the number of local
dictionaries and the value L for each dictionary have to be balanced.
2.4.</p>
    </sec>
    <sec id="sec-6">
      <title>Using q-grams</title>
      <p>Although the difference between q-order and 0-order entropy for word-based alphabets is rather
smaller than for character-based alphabets, a special encoding of some frequent q-grams can reduce the
total bit length of a compressed file. Assume the dictionary is divided into parts corresponding to the
codewords of the same length. For RMD-codes, the lengths of codewords of each next part are one bit
longer than that of the previous part. Consider a q-gram of words and denote its frequency in the text
as l. This frequency corresponds to some codeword c. We can replace each series of codewords
corresponding to this q-gram with the codeword c.</p>
      <p>Let |c| be the length of this codeword and C be the total length of codewords it replaces. If the
dictionary is already composed, inserting a new codeword shifts the rest of the dictionary. Let q1,…,qm
be frequencies of last words in parts of the dictionary containing codewords of lengths |c|,|c|+1,…,|c|+m.
After inserting, lengths of their codewords will be increased by 1, which adds q1+…+qm extra bits to
the encoded text. Then the gain from encoding the q-gram with one codeword is l(C–|c|) – q1–…– qm–
a, where a is the size of the archived record relating to that q-gram in the dictionary. If this value is
positive, replacing of the q-gram makes sense.</p>
    </sec>
    <sec id="sec-7">
      <title>3. Fast byte-aligned decoding</title>
      <p>
        Any Reverse Multi-Delimiter code can be considered as a regular language and thus recognized by
the finite automaton. The decoding automata for codes R2–∞, R3–∞, and R2,4–∞ are given and discussed in
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, they process a text bit-by-bit, which is quite slow. The main idea of a fast decoding
algorithm is a “quantification” of a decoding automaton so that it reads bytes of a code and produces
the corresponding output numbers.
      </p>
      <p>We use the following notations. Assume we have finished processing some byte of a code. The
pointer ptr is a combination of the decoding automaton state a and the number l of already decoded bits
of the last codeword. It can be calculated as follows: ptr = a·lmax+ l, where lmax is the maximal possible
bitlength of a codeword. If we multiply ptr by 256 and add a current byte of the text, we get the index
x of lookup tables (line 3 of Algorithm 1). We utilize the following lookup tables:
 Pointers[x] – the pointer for decoding the next byte;
 Numbers[x] – a 64-bit number, which consists of four 16-bit numbers we get after decoding a
current byte;
 c[x] – the number of codewords fully decoded during processing the current byte.</p>
      <p>At first, we describe a fast decoding algorithm for RMD-codes assuming the dictionary contains no
more than 216 words (Algorithm 1). In this case, 4 sequential decoded integers can be output with one
assignment of a 64-bit value. On the other hand, it is easy to show that no more than 4 codewords can
be decoded, fully or partially, while processing one byte of a code with the shortest delimiter 011.
Combining these two facts, we get Algorithm 1 for decoding texts with a short dictionary. The loop
iterating over bytes of the code is given in lines 2–8. In lines 3 and 4 we calculate the pointer given the
current byte of the code Code[i] and the previous value of the pointer ptr. Then, in line 5, we output 4
decoded numbers, even if the actual number of outputs produced by the current byte is less. In line 6
the current output position is shifted by this actual number of outputs. This means that some elements
of the array Out may be overwritten at next iterations of the decoding loop.</p>
      <p>The
byte-aligned decoding is illustrated in Fig. 1. During decoding a current byte, the last codeword is
decoded partially (e.g., the codeword k+2 in Fig. 1). If the first decoded codeword is stored in the
element Out[k], the mentioned last codeword is stored in the element Out[k+c[x]]. Thus, the result of
its partial decoding is assigned to the variable tr in line 7. On a little endian machine, this value is added
to the first decoded number at the next iteration of the decoding loop in line 5, since bytes of a value
are loaded from memory to a processor register and vice versa in the reverse order. Execution of line 5
of Algorithm 1 on a little endian machine is illustrated in Fig. 2. Note that on a big endian machine an
extra operation of shifting the value tr to the left on 48 bits is needed in line 5.</p>
      <p>
        Note that, unlike the fast decoding algorithm for RMD-codes presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we do not analyze the
number of codewords to be decoded at the current iteration of a decoded loop. This is because the 'if'
statements used for that purpose are unpredictable (i.e. can be either 'true' or 'false' with high
probability), and executing such unpredictable 'ifs' is one of the main reasons for slowing down the
programs on modern processors. The general idea of the described above algorithm resembles the fast
decoding algorithm for the BCT-code presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, the BCT lookup table index does not
depend on the length of the already decoded part of a codeword. Therefore, that table is rather more
compact, and the BCT-code can be decoded essentially faster, though providing a worse compression
ratio. If the dictionary contains more than 216 words, line 5 of Algorithm 1 should be replaced with the
following two lines:
      </p>
      <p>Here Extranumbers is the special array with 64-bit elements containing third and fourth numbers
that may be obtained during decoding a current byte. This allows us to process texts with a dictionary
containing up to 232 elements, which covers all practically interesting cases. Let us estimate the space
overhead posed by Algorithm 1. Assume we decode R2–∞ having 3 states of the decoding automaton.
The length of a codeword in the dictionary with no more than 216 words does not exceed 24. This gives
3·24=72 possible values of ptr and 72·256=18432 possible values of x obtained in line 3. Each element
of the array Numbers occupies 8 bytes, the array Pointers – 1 byte, and the array c also 1 byte. Thus,
all lookup tables occupy 184 KB, which fits into the L2 cache on most machines. For long texts, the
size will be about twice as large, and this comprises a typical L3 cache size.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Experiments</title>
      <p>We tested how the described above text preprocessing technique can improve the compression
efficiency of RMD-codes and popular archivers. The experiments were conducted for the 1GB text
from the Pizza&amp;Chilie corpus. The original text consists of 1,073,741,824 bytes, 189,528,100 words,
2,523,827 unique words. The file word-level entropy is 273,284,721 bytes, 13.535 bits per word. The
results are shown in Tables 1 and 2.</p>
      <p>In Table 1 heads of columns denote applied preprocessing actions.
(1) Encoding + Dictionary.
(2) Encoding + Dictionary + Low frequency word collapsing.
(3) Encoding + Dictionary + Low frequency word collapsing + Word capitalization.
(4) Encoding + Dictionary + Low frequency word collapsing + Word capitalization + Local
dictionary. We use local dictionaries for 60 most frequent sentinel words and each local dictionary
consists of 230 elements.</p>
      <p>(5) Encoding + Dictionary + Low frequency word collapsing + Word capitalization + Local dictionary
+ q-grams. We include only q-grams reducing the encoded file size. Usually, including only frequent
trigrams appears to be most efficient.</p>
      <p>As seen, applying local dictionaries gives the most effect, while the special encoding of q-grams
after all other transformations helps a little. Results of archiving are given in Table 2. We applied 7z,
version 16.02, 64-bit, RAR, and gzip archivers in the maximum compression mode (level 9) to the
original and RMD-encoded texts with all mentioned above preprocessing transformations.</p>
      <p>As observed, preliminary encoding with RMD-codes significantly improves RAR- or
gzipcompression ratio, by more than 10% or almost 50% respectively. LZMA-based 7z compresses the text
much better than RAR or gzip and it is recognized as one of the most powerful modern archivers.
However, even in this case RMD-codes open room for improvements. For example, the 7z-archiving
of the R3–∞-encoded text produces 0.7% smaller file than archiving this text without the
RMDpreprocessing. Let us note that preprocessing transformations preserve the synchronizability of a code
as well as the possibility of fast decoding. After all transformations, except for the low frequency word
collapsing, the fast Boyer-Moore- style pattern search also remains possible but requires, however, to
process special cases. Also, it is interesting that not archived RMD-encoded files are 32–34% smaller
than files archived with gzip. Considering the possibility of compressed search and fast decoding,
RMD-codes can be considered as a preferred format to store large textual databases compared with
gzip, which decodes texts 3–4 times slower. We measured the decoding time on the AMD Athlon
3000G processor, 32 KB of L1 cache, 512 KB of L2 cache, 4 MB of L3 cache, 16 GB RAM, and
64bit operating system Windows 10. Results are shown in Table 4 for 3 texts of different size:
 Small: The Bible, King James version, 3.83 MB, 790 018 words in total, 14 087 distinct words,</p>
      <p>SCDC(224,32).
 Middle-sized: articles randomly taken from Wikipedia (116 MB, 19 507 783 words in total, 288
179 distinct words), SCDC(175,81).
 Large: the first half of the largest file from the Pizza&amp;Chili Corpus, (512MB, 92 424 896 words in
total, 1 686 371 different words), SCDC(164,92).</p>
      <p>For comparison, we chose other known codes, that make a compressed search and/or fast decoding
possible: the Fibonacci codes Fib2 and Fib3 and the byte-aligned (s,c)-dense codes. SCDC are
parameterized, and for them we chose the code parameters providing the best compression ratio for
each text. Texts and (s,c)-code parameters are indicated below. For SCDC it is a pair (s,c).</p>
      <p>
        As seen, RMD-codes can be decoded almost as fast as SCDC and times faster than the Fibonacci
code Fib3. For the short text, the special decoding algorithm (Algorithm 1) operates even faster than
SCDC. For longer texts, Algorithm 1 can be applied only with amendment (1) and it is
4–11% slower than SCDC-decoding. Also, as seen, the R2–∞-code can be decoded 4–7% faster than
R2,4–∞. This can be explained by the larger lookup tables of the latter code (the decoding automaton for
R2,4–∞ consists of 5 states vs 3 states for R2–∞). Also, experiments on pattern matching in encoded texts
have been conducted. Both SCDC and RMD-codes allow fast pattern search of the Boyer-Moore type
in the encoded file without its decompression. Generally, the pattern search in an SCDC-file can be
performed on the byte level, while in an RMD-file we should analyze the values of bit patterns. This
means that the pattern search in an SCDC-file should be faster. However, the fastest known up-to-date
family of bit-pattern search methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] performs the bit-pattern search on the byte level, and only
when a candidate substring is found, the individual bits are analyzed. This allows performing the
bitpattern search at a speed comparable to the byte-pattern search.
      </p>
      <p>
        We apply one of such methods described in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to search for a pattern in the mentioned above Large
text. For comparison, the text was encoded both with SCDC and R2,4–∞-codes. RZkByte-wn pattern
matching algorithms was applied to the SCDC-code and RZkBit-wn algorithms to the RMD-code,
where k is the number of significant bits in a mask and n is the number of sliding windows. Different
values of parameters k and n have been tested to find all occurrences of encoded sequences of 2–1024
words randomly taken from the text. For each pattern length, the values (k,n) giving the best time in
average were chosen. The average times of 1000 runs of search algorithms are shown in Table 5, with
method parameters (k,n) in brackets.
      </p>
      <p>As seen, for very short patterns (2 words in length), the pattern search in the SCDC-encoded text is
6 times faster. This is because patterns of lengths &lt;16 bits cannot be searched on the byte level. For
longer patterns of some lengths (4, 64, or 128 words) SCDC-search is more than twice faster. However,
1024
1.61
(15,5)
1.67
(16,1)
there are some commonly searched pattern lengths when the RMD-search is almost on the same level
as the SCDC-search or even a little faster (8 or 16 words). As a pattern becomes longer, the difference
between performances of bit and byte pattern matching methods goes to zero, which is demonstrated
by the results for patterns of 512 and 1024 words in length.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Reverse multi-delimiter codes in LZ77 character-based compression</title>
      <p>
        The variable-length integer codes are in the heart of character-based data compression schemes. The
RMD-codes belong to this class and thus they can be used not only as a preprocessing tool for text
compression but also within character-based compression algorithms. For instance, they can encode
lengths and distances of segments in the LZ77-style compression. One of the most efficient LZ77-based
archivers for the natural language texts is BC-ZIP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Among versions of BC-ZIP, the
'bitoptimal' algorithm provides the best compression ratio. Its core idea is to optimize the LZ-parsing
considering the lengths of segment bit representations. Special variable-length codes ('SODA09' and
'Nibble4'), with different parameters for offsets and lengths of LZ-segments, have been developed and
implemented in the source code of BC-ZIP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These codes use the Elias gamma-style encoding. A
codeword is divided into 2 parts: the first part contains the number of a codeword class given in the
unary numeral system, while the second part consists of the main binary sequence. All codewords of
the same length comprise a codeword class. In gamma encoding, selecting the proper length classes
allows for achieving rather better compression ratios than for other variable-length codes.
      </p>
      <p>
        Although BC-ZIP codes parameters were selected very carefully to maximize the ratio of the
bitoptimal compression, experiments demonstrate that replacing the original BC-ZIP codes with
multidelimiter codes can better compress short texts of different nature. Table 6 shows the results of the
bitoptimal BC-ZIP-compression of different texts from the Canterbury corpus and parts of the English text
from the Pizza&amp;Chili corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Code parameters giving the best compression ratio were chosen both
for RMD- and BC-ZIP families. Segment offsets were encoded by R2–∞, R3–∞, or R2,4–∞ codes. Segment
lengths are shorter and the best choice for them is a code with the shortest codeword of length 2: R1–∞
or R1–5,7–∞.
      </p>
      <p>
        As seen, using RMD-codes in the BC-ZIP scheme instead of SODA09 or Nibble4 code improves
the compression ratio for all files from the Canterbury corpus, except the grammar.lsp. The compression
ratios for the 1MB English text are almost the same. However, when files become larger, the efficiency
of RMD-codes decreases since Elias gamma-style codes are better suited for large codeword sets.
Nonetheless, compressing short text messages is important, because even large texts are often divided
into small windows during the compression (see e.g. the Brotli compressor [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]).
      </p>
      <p>On the other hand, the bit-optimal LZ-parsing is performed times faster for 'stepped' gamma-style
codes. This is because less number of possible codeword lengths require fewer search resources.
However, BC-zipped file decompression can be done roughly at the same speed both for gamma-style
and RMD-encodings due to the fast decoding algorithm described in Section 3.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Conclusion</title>
      <p>The reverse multi-delimiter (RMD) compression codes are of special interest. They can be used as
a key element for the word-based natural language text compression as well as for the compact
representation of unbounded integer sequences. We establish a monotonous invertible mapping
between the set of natural numbers and the set of reverse multi-delimiter codewords. This mapping
implies the 'decoding-in-parts' principle, allowing us to construct a very fast byte-aligned decoding
algorithm based on lookup tables, which is comparable with the (s,c)-dense byte-aligned decoding
method. Given a good compression ratio, the RMD-codes provide an attractive point in the trade-off
between the compression ratio and the decoding speed in natural language text compression. Together
with the special word-level text preprocessing technique, the RMD-codes can serve as a preprocessing
tool improving the compression ratio of known archivers. Also, being delimiter-based, the RMD-codes
allow using the fast Boyer-Moore style direct pattern search in a compressed bitstream. The experiments
show that patterns of different lengths can be searched in SCDC- and RMD-compressed files with a
comparable speed.
7. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Huffman</surname>
          </string-name>
          ,
          <article-title>A method for the construction of minimum-redundancy codes</article-title>
          ,
          <source>Proc. IRE</source>
          , vol.
          <volume>40</volume>
          ,
          <year>1952</year>
          , pp.
          <fpage>1098</fpage>
          -
          <lpage>1101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Apostolico</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Fraenkel</surname>
          </string-name>
          ,
          <article-title>Robust transmission of unbounded strings using Fibonacci representations</article-title>
          ,
          <source>IEEE Trans. Inf. Theory</source>
          , vol.
          <volume>33</volume>
          ,
          <year>1987</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brisaboa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          , G. Navarro, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Esteller</surname>
          </string-name>
          , (s,c)
          <article-title>-dense coding: an optimized compression code for natural language text databases</article-title>
          ,
          <source>in: Proc. Symposium on String Processing and Information Retrieval, ser. LNCS, no. 2857. SVB</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Zavadskyi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Anisimov</surname>
          </string-name>
          ,
          <article-title>Reverse multi-delimiter compression codes</article-title>
          ,
          <source>in: 2020 Data Compression Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Bc-zip. Open source project</article-title>
          . URL: http://farruggia.github.io/bc-zip/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I. Zavadskyi</surname>
          </string-name>
          ,
          <article-title>Binary-coded ternary number representation in natural language text compression</article-title>
          ,
          <source>in: 2022 Data Compression Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          ,
          <article-title>Enhanced byte codes with restricted prefix properties</article-title>
          ,
          <source>in: String Processing and Information Retrieval, 12th International Conference Proceedings, ser. LNCS</source>
          , vol.
          <volume>3772</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Skibinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Grabowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Deorowicz</surname>
          </string-name>
          ,
          <article-title>Revisiting dictionary-based compression</article-title>
          ,
          <source>Software, Practice &amp; Experience</source>
          , vol.
          <volume>35</volume>
          , no.
          <issue>15</issue>
          ,
          <year>2005</year>
          , pp.
          <fpage>1455</fpage>
          --
          <lpage>1476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>A dictionary-based multi-corpora text compression system</article-title>
          ,
          <source>in: 2003 IEEE Data Compression Conference</source>
          ,
          <year>2003</year>
          , p.
          <fpage>448</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Awan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Lipt: A lossless text transform to improve compression</article-title>
          ,
          <source>in: 2001 IEEE Inf. Technology: Coding and Computing Conference</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>452</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>I. Zavadskyi</surname>
          </string-name>
          ,
          <article-title>Fast exact pattern matching by the means of a character bit representation</article-title>
          ,
          <source>Springer Nature Computer Science</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>181</issue>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          --
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Nitto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          ,
          <article-title>On the bit-complexity of Lempel-Ziv compression</article-title>
          ,
          <source>SIAM Journal on Computing (SICOMP)</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>4</issue>
          ,
          <issue>2013</issue>
          , pp.
          <fpage>1521</fpage>
          -
          <lpage>1541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Frangioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Farruggia</surname>
          </string-name>
          ,
          <article-title>Bicriteria data compression</article-title>
          ,
          <source>ch. 115</source>
          , pp.
          <fpage>1582</fpage>
          -
          <lpage>1595</lpage>
          . URL: http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.115.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Pizza&amp;chili corpus - English texts</article-title>
          , URL: http://pizzachili.dcc.uchile.cl/texts/nlang/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Alakuijala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farruggia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          , E. Kliuchnikov,
          <string-name>
            <given-names>R.</given-names>
            <surname>Obryk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Szabadka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Vandevenne</surname>
          </string-name>
          ,
          <article-title>Brotli: A general-purpose data compressor</article-title>
          ,
          <source>ACM Transactions Information Systems</source>
          , vol.
          <volume>37</volume>
          , no.
          <issue>1</issue>
          ,
          <issue>2019</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>