<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleksii Turuta</string-name>
          <email>oleksii.turuta@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliia Golian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vira Golian</string-name>
          <email>vira.golan@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Afanasieva</string-name>
          <email>iryna.afanasieva@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kostiantyn Onyshchenko</string-name>
          <email>kostiantyn.onyshchenko@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sviatoslav Mychka</string-name>
          <email>sviatoslav.mychka@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>14 Nauky Ave., Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text data compression methods are explored. The advantages and disadvantages of different combinations of arithmetic and Huffman encoding as entropy methods and Snappy and LZ4 as dictionary methods are described. Methods of increasing efficiency of entropy compression methods are suggested. Today, big data is stored in various distributed databases and file systems. They allow storing different files on different servers with the ability of restoration using duplicates [1]. According to [2], big data is the term that does not mean a specific amount of data, but rather its relationality to other data. This is the reason for storing the data in such database formats as Parquet, which is supported by Apache Hadoop and supports different compression algorithms, such as LZ4 or Snappy. Binary data compression plays a crucial role in distributed systems by reducing storage costs and improving processing efficiency. Unlike specific compression algorithms, the algorithms designed to work with binary data treat any information as a sequence of bits or bytes. This means any data may be encoded and then decoded without errors or losses, which is useful for the file systems and databases that can store different types of files and data. Two most popular entropy encoders, which are Huffman optimal encoding and arithmetic encoding are widely used in methods like Deflate, Bzip2 and other. Usually they are used in combination with dictionary encoders, which include LZ4 and Snappy, mentioned above, and also other algorithms with similar idea. These algorithms are designed to avoid entropy coding, but this paper investigates how would these algorithms work if combined with different types of data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text compression</kwd>
        <kwd>encoding</kwd>
        <kwd>algorithm</kwd>
        <kwd>arithmetic coding</kwd>
        <kwd>Huffman optimal coding</kwd>
        <kwd>Snappy</kwd>
        <kwd>LZ4 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Data compression algorithms are divided into two groups: lossy and lossless. According to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
feature-based data compression concept is becoming more popular nowadays. This concept unifies
lossless, near-lossless and lossy compression with the idea of features – pieces of information that
possess high discriminative or predictive value for the human interpretation or machine
processing.
      </p>
      <p>
        In this paper specific lossless compression algorithms are investigated. They can be divided into
entropy and dictionary types, depending on which approach to data compression they use. Entropy
encoders usually work with data on the bit level, attempting to reduce the number of bits needed to
encode a message, while the dictionary ones usually work with bytes, writing them into a
dictionary-like structure to reference parts of the text later. The dictionary encoders are often
called the “LZ family” due to similarity to LZ77, the first algorithm that employed this
approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Data compression algorithms are qualified by compression ratio, compression speed and
decompression speed. In their turn, these metrics depend on the data's size, the message's mean
entropy and, in the case of entropy algorithms, the size of the alphabet used in the message. The
mean entropy of the message is calculated using Shannon entropy [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] (1).
      </p>
      <p>( ) = −Σ ( )log  ( ),
(1)
where p(x) – frequency of occurrence of symbol x.</p>
      <p>
        An alphabet is a set of symbols used in a message [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the case of byte-to-byte encoding of
binary data, the alphabet always contains from 1 to 256 symbols since bytes take values from 0 to
255, and messages cannot contain zero symbols. Static alphabets can be used to create models for
entropy compression algorithms, e.g., adaptive arithmetic coding.
      </p>
      <p>Shannon’s theorem implies that N i.i.d. random variables each with entropy H(X) can be
compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞;
conversely if they are compressed into fewer than N H(X) bits it is virtually certain that
information will be lost. Assuming that symbols in the alphabet of a binary message are not
distributed independently and identically, there will be symbols that carry more information, if
their occurrence frequency is lower and vice versa. This enables lossless compression algorithms,
which reduce this redundancy and assign new codes to the symbols, according to the amount of
information they carry. If the symbols are distributed identically and do not carry natural
redundancy, i.e. the alphabet includes symbols from 0 to maximum possible value, which means
the message is completely random, then such message cannot be compressed.</p>
      <p>
        Data compression methods may utilize both dictionary and entropy coding [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or only one of
them [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. It is important that, if both methods are applied, entropy coding follows dictionary
coding, since entropy encoding algorithms tend to make the entropy of the data rise, so the data
become random and lose all the patterns that may be detected by dictionary coding [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
The purpose of this paper is to investigate how the application of entropy encoding after already
applied dictionary encoding affects the compression ratio and the time of compression and
decompression. The tasks to be solved are implementation of the compressors and testing the
obtained algorithms with different data types that can be stored in regular or distributed file
systems. One of the materials for testing compression algorithms is Silesia compression
corpus [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is a group of various binary, text, image, executable or other files. Some files
from this group will be used for testing different approaches in this paper, too.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>In this work four compression methods will be used for the research and experiments. Two of them
belong to entropy methods, and two others – to dictionary methods. To demonstrate the
algorithms we will use a simple text: “The cat sat on the mat, and the cat lied on that hat”. It
contains repetitions of symbols and consists of only 15 different symbols, which makes a fairly
small alphabet relative to that containing all the symbols of binary data.</p>
      <sec id="sec-3-1">
        <title>3.1 Huffman coding</title>
        <p>
          Huffman coding is a data compression method designed to ensure optimal encoding for each
separate symbol [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Developed based on Shannon-Fano method[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], it consists of three main
steps:
        </p>
        <p>1. During the first step the alphabet of the whole message is formed, and the frequencies of
all the symbols are calculated. These frequencies are then used to construct a Huffman tree, where
symbols with lower frequencies appear deeper in the tree, and those with higher frequencies are
closer to the root. Figure 1 depicts the Huffman tree built from the sample text mentioned before.</p>
        <p>2. After building the tree the codes are assigned to their corresponding symbols using an
associative array or other table structure. This allows to access each symbol’s code in O(1) time,
rather than looking it up in O(log2 n) time for each symbol in the message.</p>
        <p>By traversing the Huffman tree, we assign binary codes to each symbol. The tree is traversed
breadth-first from the root recursively, each time adding a zero to the code when going to the left
child, and a one going to the right child. It means that the codes are shorter for the symbols that
occur more and longer for those that occur less. This also ensures that no code is a prefix of
another, allowing for unambiguous determination of each symbol during decoding, making it more
efficient. The codes for each symbol of the sample text are given in Table 1.</p>
        <p>Code
01
001
100
111
1011
1010
00000
l
i
c
o
d
T
s
3. Finally, the text is encoded using the stored codes. It is worth noting that Huffman and
other entropy encoding algorithms operate with bits, and usually the output bit sequences are
aligned so their count is a multiple of eight. The encoded example text will result into 20 bytes
instead of initial 52 bytes, which means compression ratio is 2.6.</p>
        <p>To decode the message encoded with the Huffman algorithm, we need to build the tree used in
encoding and read the message bit by bit, traversing the obtained tree. When a symbol node is
found, the decoder writes the symbol into the output and moves the pointer to the root again. This
process is repeated until the last bit of the message is read.</p>
        <p>
          The main bottleneck of Huffman compression is that it is computationally hard to make it
adaptive. This means the output must contain the data needed to build the same Huffman tree used
during compression. Some methods, such as the Vitter algorithm, allow adaptive Huffman
encoding [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. However, it is rarely used in practice since it requires much more computations for
node swapping [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Herewith, each encoded block’s maximum storage overhead can be calculated
with (2).
        </p>
        <p>+ (
(
) + 1) ⋅ ( + 1) = 1 + 256 ⋅ 5 = 1281
,
(2)
where L – number of bytes to store the length of the alphabet, max(byte) – maximum value of byte
(255), c – number of bytes to store the number of appearances of a symbol.</p>
        <p>These calculations apply to the corner case when all 256 values of a byte are used in the
document. Still, using blocks of significant size, e.g., 100 KB or more, reduces the impact of the
header’s size.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Arithmetic coding</title>
        <p>
          Arithmetic encoding is another data compression method, which encodes messages using an
interval [0; 1) split into sub-intervals representing each symbol’s frequency. The result of
arithmetic coding is a number from 0 to 1, which represents the whole message, unlike Huffman
coding that represents separate symbols. With infinite precision messages of any lengths can
potentially be encoded into single numbers [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>The process of a non-adaptive arithmetic encoding, like the Huffman encoding, starts with
collecting the information of the alphabet used in the message and the frequencies of symbols.
After this step the interval [0; 1) is split into sub-intervals with widths according to the relative
frequency of each symbol. Figure 2 shows the graphical representation of the interval division for
the sample text.</p>
        <p>After this we process each symbol, refining the interval. At first we select the whole interval
[0; 1) and select a number from the interval that represents the symbol. For example, after
processing the letter “T” we move to [0.938; 1) and select it as the new interval. This new interval is
split again, and after processing the letter “h” we get [0.945; 0.948) and so on. After encoding each
symbol we reach the final interval [0.946063123456; 0.946063123478) and pick up a number inside
these boundaries. For example, we’ll pick a 0.946063123460. This number encodes the whole sample
text and can be stored in only 4 bytes, which means the compression ratio is 13.</p>
        <p>To decode the message we need to know the intervals for the symbols. Using them we can find
the letters one by one, since the result number will always fall into the right intervals. For example,
the number mentioned above fits into [0.938; 1.0) interval, which stands for “T”, then into
[0.945; 0.948), which stands for “h”, and so on.</p>
        <p>Unlike Huffman encoding, arithmetic encoding is more convenient to enabling adaptive variant
of encoding. This allows to avoid passing the data about symbols’ frequencies and does not require
significant complications. For example, the adaptive algorithm has two major differences from the
algorithm described above:</p>
        <p>1. At first we assume that each symbol is present in the message's alphabet with equal
frequency. It means that we initialize the initial interval with equal sub-intervals of length (3)
1 1 (3)</p>
        <p>= ≈ 0.0039,
 ( ) + 1 256
where max(byte) – the maximum value of a byte.</p>
        <p>2. With each encoded symbol, we make its sub-interval wider, assuming that with each
occurrence, the probability of this symbol appearing in the same message gets bigger. Herewith,
the frequencies of other symbols decrease proportionally to give a total of 1.</p>
        <p>These amendments allow a simple decoding process: build the same interval with equal
probabilities sub-intervals and update them with each decoded symbol. Obviously, this doesn’t
require any data about frequencies of symbols transferred with the encoded data.
3.3 LZ4
The LZ4 algorithm is an open-source data compression algorithm. It is designed to be fast, trading
CPU time for compression ratio. The LZ4 algorithm does not include entropy encoding such as
Huffman or arithmetic coding, and utilizes the dictionary approach similar to the LZ77 algorithm.</p>
        <p>An LZ4 block contains of sequences, which are suites of literals (not-compressed bytes),
followed by match copy operations. To encode a message, LZ4 scans input data for repeated
sequences and replaces them with references to previous occurrences. Literals are directly copied
without transformation, avoiding additional encoding overhead, their lengths are encoded with 4
bits before them. Match copies follow the literals, with 2 bytes meaning offset value, and the length
of the match encoded with 4 bits after literal length. A schematic image of an LZ4 block is in
Figure 3.</p>
        <p>
          To achieve fast compression and decompression – 577 MB/s and 3716 MB/s respectively for
lz4 1.10.0, according to lzbench [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] – LZ4 minimizes cache misses by avoiding branching, where
the simple block format is helpful. Also, this algorithm utilizes hash tables for storing previously
seen sequences of symbols, which enables looking up for matches in O(1) time. The decompression
algorithm avoids complex calculations, copying information from literals. Unlike LZ77, the match
copies do not allow cyclic copying from the literals.
        </p>
        <p>The encoding of the sample text “The cat sat on the mat, and the cat lied on that hat” would be
performed in these three steps:</p>
        <p>1. Finding the first literal: “The cat sat on the mat, and”, and encoding its length 27. This
length is encoded with 1 additional byte after token.</p>
        <p>2. Encoding the match copy after the literal and its length in 4 lower bits of the token:
“[space]the[space]” with length 5 and offset 13. It is encoded as 1 for the match length.</p>
        <p>3. Encoding another literal: “cat lied on that hat” with length 20 with no match copies
afterwards.</p>
        <p>
          After the process of encoding we receive 53 bytes, which means that encoding did not result in
a compression. This is the result of the small size of the message, and also in some missed matches:
the words “cat”, “hat” and “on” with surrounding space characters have not been encoded. The
reason is that LZ4 block format has limitations: the minimum match length is 4, the last 5 bytes are
always literals, and the last match must start at least 12 bytes before the end of block [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4 Snappy</title>
        <p>Snappy is another open-source compression algorithm utilizing dictionary coding. Like LZ4, it was
designed for faster encoding and decoding, so it originally does not include entropy coding and is
oriented to bytes.</p>
        <p>
          The stream of data encoded with Snappy starts with a preamble, which contains the
uncompressed length, stored in a special varint type. Varint is a type where for each byte there are
seven valuable bits and the highest bit is set if there are more bytes to be read. After this preamble
there come elements of four different types. Every element starts with a tag byte, where two lower
bits indicate the type and six higher bits – the length of the element[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>1. Literals (00). Uncompressed data, stored directly after the tag byte.</p>
        <p>2. Copies (01-11). References to the previous decompressed data, including literals. They
consist of offset, which sets the relative position back in the decoded stream, and length, which
means how many symbols should be copied. Unlike LZ4, Snappy allows lengths bigger than offsets,
which means copying the symbols in cycles. The offsets for different copy types are stored in one,
two or four bytes.</p>
        <p>The block format for Snappy is similar to LZ4 (Figure 4), though it has less limitations for match
length and the encoding of literals than LZ4. It also works with 32 KB blocks, but the format
does not specify it directly.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>To investigate the perspectives of applying entropy encoding algorithms after the dictionary
algorithms described above, we have implemented optimal Huffman coding and arithmetic coding
using C# 12 and .NET 8. We also developed a structure for handling bit storing and operations:
public struct CountedBitsArray : ICloneable
{</p>
      <p>/// &lt;summary&gt;
/// Count of bits in the last byte in the array. Bits are counted in BE
(less significant bit after more significant)
/// &lt;/summary&gt;
private byte bitCount = 0;
private List&lt;byte&gt; bytes = [];
public CountedBitsArray()
{</p>
      <p>bytes.Add(0);
}
public CountedBitsArray(Span&lt;byte&gt; storedBytes)
{
bitCount = storedBytes[0];
bytes.AddRange(storedBytes[1..]);
}</p>
      <p>}</p>
      <p>This structure is kept simple to avoid redundant operations and branching. It has several
convenience methods for adding a bit, checking a bit on a specific position or copying another
BitArray.</p>
      <p>/// &lt;summary&gt;
/// Add bit to the last byte in &lt;paramref name="bytes"/&gt;
/// &lt;/summary&gt;
public void AddBit(bool value)
{
if (bitCount == 8)
{
bytes.Add(0);
bitCount = 0;
}
if (value) // no need to assign 0 since it is already there
{
var mask = (byte)(1 &lt;&lt; bitCount);
bytes[^1] |= mask;
}
bitCount++;
}
/// &lt;summary&gt;
/// Get value of a bit in the specified position, starting from the
beginning of &lt;paramref name="bytes"/&gt;
/// &lt;/summary&gt;
public readonly bool IsBitSet(int pos)
{
// get the required byte
var byteNum = pos / 8;
if (byteNum &gt;= bytes.Count) return false;
// get the required bit in byte
var bitNum = pos % 8;
// use bit mask
return (bytes[byteNum] &amp; 1 &lt;&lt; (bitNum)) != 0;
}
public void AddBits(int count, bool value)
{
for (int i = 0; i &lt; count; i++)
{
}</p>
      <p>AddBit(value);
}
public void AppendBitArray(CountedBitsArray bitsArrayToAppend)
{
for (int i = 0; i &lt; bitsArrayToAppend.BitsCount; i++)
{</p>
      <p>AddBit(bitsArrayToAppend.IsBitSet(i));
}</p>
      <p>}</p>
      <p>The arithmetic encoder was implemented as an integer range coder, which works similarly but
allows the use of finite-precision data types . To implement such behavior, specific numbers that
play roles of a “whole”, “half” and “quarter” are introduced, such as:</p>
      <p>ℎ
ℎ
=
, 
=
 ℎ
,  ⋅  ℎ
≤</p>
      <p>(int64)
2 4
where R – needed level of precision, max(int64) – maximum value of a 64-bit integer.</p>
      <p>This gives an oppurtunity to simplify the algorithm and use only integer values without the
need to perform costly division operations. Adaptive arithmetic coding becomes much faster and
easier to implement. Below a part of encoding function is presented:
(4)
var symbolCount = Models[modelNum].Length;
var symbolFreq = new int[symbolCount];
symbolFreq.Populate(1); // every symbol's frequency is initialized as 1
var symbolFreqAdded = new int[symbolCount];
symbolFreqAdded.SumUp(symbolFreq, 0);
var totalFreq = symbolFreq.Sum();
var lowerBound = 0L;
var upperBound = Whole;
var splits = 0;
var result = new CountedBitsArray();
foreach (var s in data)
{</p>
      <p>EncodeSymbol(symbolFreq, symbolFreqAdded, ref totalFreq, ref lowerBound,
ref upperBound, ref splits, ref result, Array.BinarySearch(Models[modelNum],
s));
}</p>
      <p>EncodeSymbol(symbolFreq, symbolFreqAdded, ref totalFreq, ref lowerBound, ref
upperBound, ref splits, ref result, symbolCount - 1);
++splits;
if (lowerBound &lt;= Quarter)
{
result.AddBit(false);
result.AddBits(splits, true);
}
else
{</p>
      <p>The Snappy and LZ4 algorithms are used as ‘black boxes” – their default implementations are
adopted “as is” and remain unchanged. To launch and direct the algorithms, the default command
line interfaces (CLI) are used, if present. If a CLI is absent, third-party CLI can be used, which
does not affect the compression or decompression and serves for the convenience of utilizing the
algorithms.</p>
      <p>
        To test the algorithms, files from Silesia compression corpus were used [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Their descriptions
are presented in Table 2.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>First, the files were compressed with LZ4 and Snappy without further application of Huffman and
arithmetic coding to measure their performance without any external influence. LZ4 was used with
different compression levels that tend to increase CPU time but also increase compression ratios.
The results of compression are presented in Table 3.</p>
      <p>It’s worth noting that LZ4 and Snappy CLI don’t have built-in measurement of time spent for
compression or decompression, and third-party time measurement was used. It means that the time
periods mentoned in Table 2 are approximate and serve only for observing the relative time period
changes.</p>
      <p>The files were also compressed using the implemented Huffman and arithmetic coding
algorithms with different block sizes. The results are given in Table 4.</p>
      <p>After these tests the files were compressed in a chain: first with LZ4 or Snappy and then with
Huffman or arithmetic coding. The mean compression ratios and time of encoding and decoding
each file are given in Tables 5-7.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussions</title>
      <p>From Tables 3-5 we can see that the usage of entropy coding after the dictionary coding allowed to
increase the mean compression rates in range from 7 to 17% for different files. The compression
efficiency depend on the types of files, and binary files without a structure, like “sao” in the test
corpus, are compressed with lower ratios than binary files that contain a structure, like “nci”,
which gave the highest compression ratios with both entropy and dictionary coders. XML file,
named “xml”, was a good target for LZ4 and Snappy, but Huffman and arithmetic encoding gave
compression ratios almost as low as for the executable and plain text files (“mozilla” and “dickens”
respectively). The entropy encoding algorithms were slightly better with an image “mr”, but the
best file to be encoded with entropy encoding was “nci”, too.</p>
      <p>The tests of LZ4 implementations showed that in most cases using compression levels more
than 4 was not justified, as it made compression slightly slower but did not give much higher
ratios. Usually LZ4 worked slower than Snappy, but the compression ratios Snappy gave exceeded
only the first level of LZ4 in simple text files. But using entropy coding after Snappy in most cases
raised the ratios more than if used after LZ4 of any level.</p>
      <p>In most cases arithmetic coding resulted in higher compression ratios than Huffman coding, but
the difference was rarely significant. The biggest difference between the ratios of these two
compression methods was observed during the compression of “mr” file with block size of 32 KB,
with value of 0.16. Also, arithmetic coding appears to be much slower, and it can be observed the
best with bigger binary files of higher entropy. The encoding speed of arithmetic compressor is
affected more from the size of block, too, and using bigger blocks is more expedient with bigger
files.</p>
      <p>The decompression speed of LZ4 and Snappy depends more on the size of a file, and does not
depend on the compression level or type of the file. The decompression speed in case of using
dictionary and entropy coding was affected by entropy coding, which tends to be slower than
the dictionary one. The arithmetic decoding is also slower than Huffman decoding, like in the case
of encoding.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The dictionary encoders, widely used in distributed platforms, can give satisfactory compression
ratios with structured data, performing operations at high speed. The results obtained show that
using relatively slow Huffman or arithmetic encoding after fast dictionary encoders can increase
compression ratios, but require more time for compression and decompression. Combining
dictionary and entropy encoders allows to mitigate the impact of high entropy of binary files.</p>
      <p>
        The topic is valuable for further development of various distributed platforms, databases and
other information storage structures. There are significant results shown that the algorithms’
combinations are good for compressing structured databases, but also they can be useful in
advancing compression ratios in the conditions of limited storage size or Internet traffic. In the
future, we plan to develop and compare other compression algorithms, including the ones that use
machine learning: PPM, context mixing [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] etc. The models from the implemented arithmetic
coding could be expanded and improved with use of these techniques.
      </p>
      <p>The work has practical value for intelligent systems, as it provides a foundation for integrating
data compression strategies into the pipeline of intelligent data processing. Modern intelligent
systems, including recommender systems, distributed knowledge bases, and AI services, heavily
rely on the storage and transmission of large volumes of structured data. The paper proposes a
compression strategy that can optimize resource usage under bandwidth or memory constraints.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Farooq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kalim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Qureshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abid</surname>
          </string-name>
          ,
          <article-title>A Blockchain-Based Framework for Distributed Agile Software Development</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>17977</fpage>
          -
          <lpage>17995</lpage>
          , doi:10.1109/ACCESS.
          <year>2022</year>
          .3146953
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <article-title>Six Provocations for Big Data, in: A Decade in Internet Time: Symposium on the Dynamics of the Internet</article-title>
          and Society,
          <year>2011</year>
          , doi:10.2139/ssrn.1926431
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Podgorelec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Strnad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kolingerová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Žalik</surname>
          </string-name>
          ,
          <article-title>State-of-the-</article-title>
          <source>Art Trends in Data Compression: COMPROMISE Case Study, Entropy</source>
          <volume>26</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .3390/e26121032
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kempa</surname>
          </string-name>
          , T. Kociumaka,
          <article-title>Lempel-Ziv (LZ77) Factorization in Sublinear Time</article-title>
          ,
          <source>in: 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)</source>
          , Chicago, IL, USA,
          <year>2024</year>
          , pp.
          <fpage>2045</fpage>
          -
          <lpage>2055</lpage>
          , doi:10.1109/FOCS61266.
          <year>2024</year>
          .
          <volume>00122</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.V.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <article-title>On Defining Expressions for Entropy and Cross-Entropy: The Entropic Transreals and Their Fracterm Calculus</article-title>
          ,
          <source>Entropy</source>
          <volume>27</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .3390/e27010031
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H. Alizadeh</given-names>
            <surname>Noughabi and M. Shafaei Noughabi</surname>
          </string-name>
          ,
          <article-title>A New Estimator for Shannon Entropy</article-title>
          , Statistics,
          <source>Optimization &amp; Information Computing</source>
          <volume>13</volume>
          (
          <year>2025</year>
          )
          <fpage>891</fpage>
          -
          <lpage>899</lpage>
          , doi:10.19139/soic-2310-
          <fpage>5070</fpage>
          -
          <year>1844</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Somazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Garlaschelli</surname>
          </string-name>
          , On Nonlinear Compression Costs:
          <article-title>When Shannon Meets Rényi</article-title>
          ,
          <source>IEEE Access 12</source>
          (
          <year>2024</year>
          )
          <fpage>77750</fpage>
          -
          <lpage>77763</lpage>
          , doi:10.1109/ACCESS.
          <year>2024</year>
          .3406912
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Tayyeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S. Ahmed</given-names>
            <surname>AL-Jumaili</surname>
          </string-name>
          ,
          <article-title>A combination of least significant bit and deflate compression for image steganography</article-title>
          .
          <source>International Journal of Electrical and Computer</source>
          Engineering (IJECE)
          <volume>12</volume>
          (
          <year>2022</year>
          ),
          <volume>358</volume>
          , doi:10.11591/ijece.v12i1.
          <fpage>pp358</fpage>
          -
          <lpage>364</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Collet</surname>
          </string-name>
          , LZ4 Block Format Description,
          <year>2022</year>
          . URL: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>V.</given-names>
            <surname>Costan</surname>
          </string-name>
          , Snappy Format Description,
          <year>2011</year>
          . URL: https://github.com/google/snappy/blob/main/format_description.txt
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shapira</surname>
          </string-name>
          ,
          <source>On the Randomness of Compressed Data, Information</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <article-title>196</article-title>
          . doi:
          <volume>10</volume>
          .3390/info11040196
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. S. Deorowicz, Silesia compression corpus,
          <year>2003</year>
          . URL: https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>D.</given-names>
            <surname>Huffman</surname>
          </string-name>
          ,
          <article-title>A Method for the Construction of Minimum Redundancy Codes</article-title>
          ,
          <source>Proceedings of the IRE</source>
          <volume>40</volume>
          (
          <issue>9</issue>
          ) (
          <year>1952</year>
          )
          <fpage>1098</fpage>
          -
          <lpage>1101</lpage>
          . doi:
          <volume>10</volume>
          .1109/JRPROC.
          <year>1952</year>
          .
          <volume>273898</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>S.</given-names>
            <surname>Congero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeger</surname>
          </string-name>
          , Competitive Advantage of Huffman and
          <string-name>
            <surname>Shannon-Fano</surname>
            <given-names>Codes</given-names>
          </string-name>
          ,
          <source>IEEE Transactions on Information Theory</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/tit.
          <year>2024</year>
          .
          <volume>3417010</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Vitter</surname>
          </string-name>
          ,
          <article-title>Design and analysis of dynamic Huffman codes</article-title>
          ,
          <source>Journal of the ACM</source>
          <volume>34</volume>
          (
          <year>1987</year>
          )
          <fpage>825</fpage>
          -
          <lpage>845</lpage>
          . doi:
          <volume>10</volume>
          .1145/31846.42227
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. S. Song,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A lossless compression method for logging data while drilling</article-title>
          ,
          <source>Systems Science &amp; Control Engineering</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ) (
          <year>2021</year>
          )
          <fpage>689</fpage>
          -
          <lpage>703</lpage>
          , doi:10.1080/21642583.
          <year>2021</year>
          .1981478
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>F.</given-names>
            <surname>Auli-Llinas</surname>
          </string-name>
          ,
          <article-title>Fast and Efficient Entropy Coding Architectures for Massive Data Compression</article-title>
          ,
          <source>Technologies</source>
          <volume>11</volume>
          (
          <year>2023</year>
          ) 132, doi:10.3390/technologies11050132
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Przemyslaw Skibinski, lzbench,
          <year>2025</year>
          . URL: https://github.com/inikep/lzbench?tab=readme-ovfile#benchmarks
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>R. M. K. Mohideen</surname>
            , P. Peter,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Weickert</surname>
          </string-name>
          ,
          <article-title>A systematic evaluation of coding strategies for sparse binary images</article-title>
          ,
          <source>Signal Processing: Image Communication</source>
          <volume>99</volume>
          (
          <year>2021</year>
          ) 116424, doi:10.1016/j.image.
          <year>2021</year>
          .116424
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>20. L .</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>