Reverse Multi-Delimiter Codes in English and Ukrainian Natural Language Text Compression Igor Zavadskyia and Viktoriia Zavadskab a Taras Shevchenko National University of Kyiv, 4d Glushkova ave, 03022, Kyiv, Ukraine b National Technical University of Ukraine ‘Igor Sikorsky Kyiv Polytechnic Institute’, 37 Prosp. Peremohy, 03056, Kyiv, Ukraine Abstract A new “reverse” version of multi-delimiter data compression codes is presented. A fast decoding algorithm for these codes is discussed. The experiments provided to compare the compression ratio and decoding time of different codes applied to Ukrainian and English natural language texts. The optimal parameters of reverse multi-delimiter codes for Ukrainian natural language text compression are identified. Keywords 1 Natural language, Ukrainian, Compression, Code, Multi-Delimiter 1. Introduction Natural language text compression is one of the key components of modern information retrieval systems. From a perspective of an input alphabet structure, all compression methods in this field can be divided into two groups. The first consists of methods operating individual characters as alphabet elements, while methods of the second group rely on words of a text as atomic units. Generally, the latter methods provide a better compression ratio, and we focus our attention on them. This approach implies using a dictionary, i.e., some mapping between the set of words of a text and the set of codewords. In terms of compression ratio, it is not important how this mapping is implemented technically; the only requirement is that shorter codewords correspond to more frequent words of a text. Several known codes provide the compression ratio close to the theoretical limit defined by Shannon's entropy. This refers to arithmetic encoding, codes based on asymmetric numeral systems [4], and, to some extent, Huffman codes [5]. However, apart from the compression ratio, a number of other requirements should be taken into account, for which the above-mentioned codes are not well-suited.  Encoding/decoding speed. The decoding is more important since it is often performed “online”, many times for a text encoded “offline” only once.  Synchronizability. This means that possible error in the codeword sequence has a limited area of propagation.  Compressed search. Possibility of searching information in the compressed file without it decompression can significantly improve the performance of information retrieval systems. In order to provide fast decoding, a data structure with low access time should be used to store the dictionary. For these purposes, the most efficient data structure is the array with integer indices. Then, at each iteration of the decoding loop, we need to extract a codeword from a coded bitstream, convert this codeword to a number, and output the corresponding dictionary element. The mentioned extraction and conversion can be done in parallel, and thus the decoding algorithm, in fact, consists in mapping a bitstream of codewords to a sequence of integers. Since a bitstream is divided into bytes in computer memory, and the processor operates with bytes or even machine words, it is reasonable to explain a decoding algorithm in terms of byte-level or machine-word-level operations. Information Technology and Implementation (IT&I-2021), December 01–03, 2021, Kyiv, Ukraine EMAIL: ihorzavadskyi@knu.ua (A. 1); ptits@ukr.net (A. 2) ORCID: 0000-0002-4826-5265 (A. 1); 0000-0002-6411-4926 (A. 2) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 211 There are two ways to achieve this goal. The first consists of constructing codewords from the whole number of bytes only. Such codes are called byte-aligned. The most advanced representatives of this family are (s,c)-dense codes (SCDC) [2] and codes with restricted prefix properties (RPBC) [3]. They make possible ultra-fast decoding, however, at the cost of compression ratio, which becomes 14–18% beyond the entropy when English texts are compressed. The other way is to use lookup tables, which allows us to decode whole bytes or other blocks of bits at one iteration of a decoding loop. This approach is implemented for Fibonacci codes in [6] and recently invented multi-delimiter / reverse multi- delimiter (MD) codes in [1]. These codes provide a better compression ratio, 2–5% (multi-delimiter codes) or 4–7% (Fibonacci codes) beyond the entropy. And although the proposed fast methods of their decoding are 25–100% slower than those ones for byte-aligned codes, in general, MD-codes are at attractive point in the tradeoff between compression ratio, decoding speed, and other useful code properties. Nonetheless, for MD-codes, one important problem was not clarified. It weights toward the efficient dictionary indexing. Namely, the encoding mapping ψ, described in [1], has one quite essential disadvantage: the codewords ψ1, ψ2,… are not sorted in ascending order of their lengths. It follows that the codewords ψ1,…,ψn do not constitute the set of n shortest codewords. However, in order to gain the maximum compression ratio, one should use the n shortest codewords 𝜓𝑖1 ,…, 𝜓𝑖𝑛 where in could be much greater than n. Thus, an array having non-empty elements with indices i1,…,in could be very sparse and impractical even for coding relatively short texts containing 2000-3000 different words. Although some workarounds to resolve this issue were presented in [7], the best option is to construct a monotonic encoding mapping ψ (i.e., if i1>i2 then 𝜓𝑖1 ≥ 𝜓𝑖2 ). However, for MD-codes, it seems to be problematic to build the inverse mapping based on processing bits of a codeword from left to right, since processing a part of a codeword doesn't give much information about the position of a whole codeword in the dictionary. One can assume that this mapping can be constructed for a non-prefix code, if shorter codewords are prefixes of longer ones. Let us discuss uv — the concatenation of a codeword u with some suffix v. To decode it, we can proceed as follows: 1. in the ordered set of all codewords obtain the index of u; 2. increase this index by some value which depends on v and can be calculated by the special algorithm. We call this approach “decoding in parts”. In particular, it allows us to process codewords that occupy several bytes in a byte-by-byte manner. Note that if we write bits of codewords of a multi-delimiter code in the reverse order, then we get a non-prefix but a uniquely decodable code. Indeed, an encoded file can be decoded from right to left as a multi-delimiter code. Also, it can be uniquely decoded from left to right, since in this code all delimiters remain the same as in the source multi-delimiter code, however, they are placed in the beginnings of codewords instead of their ends. Such “reverse” multi-delimiter codes are the main subject of this presentation. For modified this way multi-delimiter codes, the most remarkable property we discover is the existence of a monotonic encoding mapping from the set of natural numbers to the set of codewords, for which a simple inverse decoding mapping based on processing bits from left to right can be easily constructed. In this paper, we give the definition of the reverse multi-delimiter codes (Section 2) and describe a fast method of their decoding implementing the “decoding in parts” principle (Section 3). In Section 4 we discuss the results of experiments measuring the compression ratio and decoding speed of different reverse multi-delimiter, Fibonacci, and byte-aligned codes applied to compression of English and Ukrainian texts. Also, we discuss the specificities and differences between English and Ukrainian natural language text compression. In Section 5 we make some final conclusions. 2. Codeword Set Let 𝑀 = {𝑚1 , . . . , 𝑚𝑡 } be a set of integers, given in the ascending order, 0 < 𝑚1 <. . . < 𝑚𝑡 . The reverse multi-delimiter (RMD) code 𝑅𝑚1 , . . . , 𝑚𝑡 consists of all the words of the form 1𝑚𝑖 0, 𝑖 = 1, . . . , 𝑡 and all other words that meet the following requirements: 3. a word starts with the prefix 01𝑚𝑖 0 for some 𝑚𝑖 ∈ 𝑀; 4. for any 𝑚𝑖 ∈ 𝑀 a word does not end with the sequence 01𝑚𝑖 ; 212 5. for any 𝑚𝑖 ∈ 𝑀 a word can contain the sequence 01𝑚𝑖 0 only as a prefix. This implies that delimiters in 𝑅𝑚1 , . . . , 𝑚𝑡 are the sequences of the form 01𝑚𝑖 0. However, the code also contains shorter words of the form 01𝑚𝑖 that constitute delimiters when appended by the leading zero of a next codeword. Now let us construct a monotonous encoding mapping that maps the set of natural numbers to the set of codewords of the RMD-code 𝑅𝑚1 , . . . , 𝑚𝑡 . Let 𝐾𝑘1 , . . . , 𝑘𝑞 be the sequence of integers in the range [0; mt +1] given in the ascending order that don’t belong to M. For instance, K={0,1,3,6} for the code R2,4,5. Note that each codeword of 𝑅𝑚1 , . . . , 𝑚𝑡 has the following structure: it consists of a prefix 01𝑚𝑖 , for some 𝑚𝑖 ∈ 𝑀, which can be followed by some groups of bits having the form 01 s, where 𝑠 ∈ 𝐾 or s>kq. The inverse statement is also correct: a bit sequence having the described structure constitutes a codeword. Therefore, in the code 𝑅𝑚1 , . . . , 𝑚𝑡 any codeword of the length L can be constructed in one (and only one) of the following ways: 1. it is a word of the form 01𝑚𝑖 , 𝑖 = 1, … , 𝑡; 2. it is composed of a codeword of the length L-s-1 followed by the sequence 01𝑠 , 𝑠 ∈ 𝐾; 3. it is composed of a codeword of the length L-1 with a suffix 01𝑟 , r > mt appended by a single ‘1’ bit. Following these facts, we formulate a principle of constructing the ordered set of codewords of the length L assuming the shorter codeword sets have been already built. 1. For i from 1 to q, replicate the sets of codewords of lengths L-ki-1 and append the sequences of the form 01𝑘𝑖 , 𝑘𝑖 ∈ 𝐾, to them. 2. Replicate the codewords of the length L-1 with a suffix 01r, r>mt, and append a single ‘1’ bit to all elements of this set. 3. If L=mi+1, 𝑖 ∈ {1, … , 𝑡}, append the word 01𝑚𝑖 to the codeword set. If we apply this approach gradually to form the sets of codewords of lengths m1+1, m1+2,…, we obtain the whole set of codewords ordered by their lengths. For example, the set of 8-bit or shorter codewords belonging to the code R2,4,5 is given in Fig. 1. Also, it demonstrates how the words of the length 8 are constructed from the words of lengths 7, 6 and 4 by appending bit sequences 0, 01, and 0111 respectively (rule 1). Three codewords highlighted in grey are constructed by applying the rule 3. The shortest codeword that is constructed by applying the rule 2 has the length 11; its construction is shown in the left bottom part of the figure. Let us note that the reverse multi-delimiter codes possess all properties of MD-codes, e.g. unique decodability, completeness, and universality as well as their asymptotic densities, because 𝑅𝑚1 , . . . , 𝑚𝑡 contains the same number of codewords of a given length as the “direct” multi-delimiter code 𝐷𝑚1 , . . . , 𝑚𝑡 . For the original version of multi-delimiter codes these properties were proven in [1]. The completeness means that no codeword can be added to codeword set so that the code remains uniquely decodable. And a code is universal in the sense of Elias [8] if the average codeword length is no more than constant times longer than the average codeword length of the optimal code for any source alphabet characters distribution. The number of short codewords of some reverse multi-delimiter codes and, for comparison, Fibonacci code Fib3 are given in Table 1 (in natural language text compression the code Fib3 is considered to be the most efficient one among Fibonacci codes). As seen, the RMD-codes can contain rather bigger number of short codewords than Fibonacci code Fib3. Of course, this superiority is compensated by the number of long codewords. However, the number of short codewords is rather more important when it comes to compression of real world textual databases. 3. Encoding and Decoding Once the set of codewords c1, c2, … ordered by their lengths is built, the encoding of a text becomes trivial. It only requires ordering words of a text to be encoded according to descending order of their frequencies. The word wi in this ordered set gets the codeword ci. The decoding is rather more sophisticated. In fact, it concludes recognizing a codeword ci in the flow of codewords and calculating its index i in the set of codewords built by the principle given in the previous section. Note that the code 𝑅𝑚1 , . . . , 𝑚𝑡 is a regular language in the binary alphabet. Thus, it 213 can be recognized by a finite automaton, which processes bits of a codeword from left to right. We can apply this automaton to calculate the index of a codeword in the ordered set of codewords. Figure 1: Construction of the codeword set of the code R2,4,5 Table 1 Number of codewords of length ≤n in codes with delimiters Code\n 3 4 5 6 8 10 15 20 Fib3 1 2 4 8 28 96 2031 42762 R2–∞ 1 3 7 14 46 133 1581 17690 R3–∞ 0 1 3 7 30 110 2413 50941 R2,4–∞ 1 2 5 10 37 122 2113 35283 In the sequel, we concentrate on the “infinity” versions of RMD-codes, e.g. R2–∞ or R2,4–∞, as they demonstrate the best compression ratio. The “infinity” R2–∞ code is built on all delimiters containing 2,3,… ones. However, in practice it is enough to limit the length of delimiters by some relatively big number, for example 21, since the difference in compression ratio between R2,21 and R2,22 is negligibly small. The decoding automaton for the code R2–∞ is shown in Fig. 2. It processes codewords in the flow starting in the state marked by the filled circle and finishing in the state 1 after processing a delimiter in the beginning of some codeword. Thus, for the decoding to be successfully completed, an encoded file must be appended with some delimiter, e.g. 01110. The automaton recognizes codewords bit-by- bit and calculates the values of two variables: L - a length of a codeword and p – a resultant index of a codeword. On each transition, an input bit is indicated before the vertical bar and operations on L and p are indicated after it; the order of operations is important. Let us explain how the automaton works. Any codeword of R2–∞ can be represented as a concatenation of bit sequences of the form 01t, where t ≥ 0. After processing any such sequence and ‘0’ after it the automaton comes to the state 0. If t ≥ 2, the bit sequence 01t can be only the prefix of a codeword. Thus, after processing thew last 1 in the sequence 011, the automaton comes to the state 2 from the state 1, outputs the resultant index p of the previous codeword and initializes L with a new value 3=|011|. On all other transitions, except for the initial one, L is incremented by 1. When the automaton comes from the state 2 to the state 0, this means that the prefix of a codeword of a form 01L-10 is processed and p should be initialized with the index of the codeword 01 L-1 in the codeword set, counted from 0. As seen in Fig. 1, this index is equal to nL-1, where nL is the number of codewords of length not greater than L. 214 Figure 2: The decoding automaton for R2–∞ After processing the bit sequences 00 or 010, the automaton makes transitions to the state 0 from the states 0 and 1 respectively and adds to the codeword position p the values nL0 or nL1 respectively. nLi is equal to the shift of a codeword in the list of all codewords due to appending the bit sequence of the form 01i, 0 ≤ i ≤ 1, when the length of a resultant codeword is L. Note that the aforementioned principle of constructing the codeword list guarantees that appending the same group 01…1 to any codeword of a given length shifts this codeword to the same distance. The values nLi and nL can be calculated in the preprocessing stage together with the ordered codeword set. For different tested texts (see Section 4), codes R3–∞ and R2,4–∞ also provide the best compression ratio, apart from R2–∞. Their decoding automatons are shown in Fig. 3 and Fig. 4 respectively. The automaton for R3–∞ is pretty like that one for R2–∞, although having the extra state 3. The automaton for R2,4–∞ also recognizes the sequence 0110 as the beginning of a codeword. In this case it makes the transition from the state 2 to the state 0, outputs the result p of decoding of the previous codeword, assigns 4 to L, since 4 bits of a new codeword are processed, and 0 to p, because the codeword 011 occupies position 0 in the ordered codeword set. Applying the decoding automaton to processing an encoded text bit-by-bit could be quite slow. However, we can make the byte “quantification” of an automaton. It is done by iterative reading a fixed integral number of bytes and producing the corresponding output numbers. Figure 3: The decoding automaton for R3–∞ The principle of quantification is the same as described in [1]. The quantified automaton is given as a lookup table TAB[astart][Lstart][x] consisting of tuples (p, afin, Lfin), where x is a portion of bytes of the encoded file, astart is the state of the automaton before processing x, afin is the state it comes to after processing x, Lstart is the length of an already decoded part of the last codeword, which is under processing when the decoding of x starts, Lfin is the length of a decoded part of the last codeword generated from x and p is the output generated by the automaton while processing x. 215 Figure 4: The decoding automaton for R2,4–∞ The byte-aligned decoding algorithm for any code with the shortest delimiter 0110 is given below. Its lookup table is built by the automatons from Fig. 2 – Fig. 4, quantified to read bytes of the encoded text. In order to fast decoding, we use the series of one-dimensional lookup tables TABj[x] instead of one three-dimensional. The number j of the lookup table depends on the parameters astart and Lstart, e.g. it can be obtained as 5Lstart + astart (since 5 is the number of automaton states). Also, the tuples in the lookup table contain only the number jfin instead of a pair (afin, Lfin). The output is represented by numbers p1, p2, p3, p4 – the indices of decoded codewords or their decoded parts in the array of all codewords. We have 4 values here, since no more than 4 codewords can be decoded, fully or partially, while processing one byte of a code if its shortest delimiter is 0110. Indeed, the only case, when the parts of 4 codewords are included into 8-bit sequence, is x 011 011 0. This byte includes two shortest codewords 011, prepended with the ending bit x of some codeword and appended by the first bit 0 of another codeword. Also, note that the value p1 can also represent the shift of the codeword uv in the list of codewords towards its already decoded prefix u. Let us note that this algorithm is suitable for decoding any reverse multi-delimiter code with the shortest delimiter containing two ones: 0110. Input: Encoded text Output: The indices of decoded words in the array of codewords i  0; j  0; p  0; while(i< length of encoded text) (p1, p2, p3, p4, j)  TABj[Text[i]]; p  p+p1; if(p2 ≥ 0) output(p); if(p3 ≥ 0) output(p2); if(p4 ≥ 0) output(p3); p  p4; else p  p3; else p  p2; i  i+1; Algorithm 1: The byte-aligned decoding algorithm for R2,x 216 It is easy to calculate the size of the lookup table of the presented byte aligned decoding algorithm. For the larger English text, in our experiments containing 288K different words, the maximum codeword length L is 34. Considering the number of automaton states and space needed to store the values p1–p4 and j, we get approximately 300–500K of memory needed to store all lookup arrays TABj, which is quite acceptable for modern computers. However, the mentioned lookup tables are too large to fit into L1 cache memory and maybe even to L2 cache, which leads to slowing down the algorithm. In order to eliminate the problem, we developed a method of packing the value TABj[Text[i]] into a single 32-bit machine word. Apart from reducing the size of lookup tables, it allows us to load all needed values into a processor register via one memory read at each iteration of the decoding loop, requiring, however, some extra bit-level operations to extract them. In total, this method can accelerate the decoding by 15–25%. We don't explain it in detail, since it is based on a lot of low-level specifics, not impacting the Alg. 1 itself. Nevertheless, this approach was implemented in program and results of its testing are shown in the next section. 4. Experimental Results The discussed data compression codes were tested on two English and two Ukrainian texts of different size. The smaller one is The Bible – English, King James version, and Ukrainian translated by I. Ohienko. The larger Ukrainian text consists of 27 novels by different 19–20th century writers, 15.5MB in size, while the larger English text contains the 100MB collection of articles randomly taken from Wikipedia. The larger English text was preformatted by eliminating punctuation marks; upper- and lowercase letters are distinguished. The Ukrainian Bible was preformatted by eliminating the verses numbers. Two other texts were encoded without preformatting. Let us note that although larger Ukrainian and English texts are different in size, they contain close numbers of different words and thus are comparable from data compression point of view. The compression efficiency of different codes applied to this corpus is shown in Table 2, together with parameters of texts. The compressed text size is given in bits per word with the exceedance over the entropy shown in percentage. For each text, the best result is highlighted in bold. The parameters of SCDC and RPBC codes for each text have been chosen to maximize the compression ratio: RPBC(129,125,1,0) and SCDC(172) for Ukrainian Bible; RPBC(191,63,1,0) and SCDC(197) for English Bible; RPBC(110,140,5,0) and SCDC(165) for Ukrainian fiction corpus; RPBC(153,97,5,0) and SCDC(175) for English Wikipedia articles.As seen, among investigated codes, different representatives of RMD-family show the best compression ratio for all texts in both languages. This is confirmed by the graph in Fig. 5, illustrating the Ukrainian and English Bibles encoding. Both text and codeword set distributions are shown. Positions of words ordered by decreasing frequency are shown in x-axis, differing by a factor of 1.2. Y-axis contains the codeword lengths for codeword sets and values -log pi for texts (pi is the probability to meet the i-th word in the text). A theoretically best codeword set should reach the entropy limit 𝐻 = ∑𝑖 −𝑝𝑖 log𝑝𝑖 , i.e. the length of i-th codeword (in the order of increasing lengths) has to be equal –log pi. Thus, the closer a codeword set curve to a text curve, the better compression ratio a code provides. As seen, both R2–∞ and R2–∞ curves align with English text curve (blue) better than with Ukrainian (orange), although R2,4–∞ is a bit closer to Ukrainian curve than R2–∞. In fact, all codes, except for byte-aligned ones, compress English texts better than Ukrainian, since the Ukrainian distribution curve is too flat (that is to say, words in Ukrainian text are distributed more uniformly). At the same time, the `ladder-shaped' distribution curve of byte-aligned RPBC codes is not so far from flatter Ukrainian word distribution than from more steep English. Nonetheless, the compression ratio of byte-aligned codes, in comparison with other codes, remains for Ukrainian texts the worst, although almost on a level with Fibonacci Fib2 and the Remote Multi-Delimiter R4–∞ codes. Generally, for Ukrainian texts the dispersion of deviations of compressed text size from entropy is less, while the medium deviation is bigger than for English. This means that some codes in research are suited for English text compression well and some not, while any code cannot be considered as very well suited for Ukrainian natural language texts. However, due to existence of many customizable parameters, reverse multidelimiter codes can be aligned with Ukrainian natural language text word distribution more tightly than other codes of the discussed class. 217 Table 2 Codes properties and compression ratio Ukrainian Bible Ukrainian English Bible, English Format- fiction corpus King James ted Wikipedia version Texts Entropy 11.986 13.436 9.48 11.078 Total words (TW) 586 254 1 428 193 766 112 19 507 784 Different words (DW) 74 996 258 215 28 659 288 179 DW/TW 7.817 5.531 26.732 67.693 Fib2 13.373 (11.6%) 15.267 (13.6%) 9.993 (5.4%) 11.983 (8.2%) Fib3 12.561 (4.8%) 14.061 (4.7%) 9.844 (3.8%) 11.564 (4.4%) SCDC 13.858 (15.6%) 15.411 (14.7%) 11.024 (16.3%) 12.869 (16.2%) RPBC 13.522 (12.8%) 15.035 (11.9%) 10.953 (15.5%) 12.685 (14.5%) R2–∞ 12.81 (6.9%) 14.661 (9.1%) 9.711 (2.4%) 11.598 (4.7%) R2,4–∞ 12.506 (4.3%) 14.117 (5.1%) 9.749 (2.8%) 11.393 (2.8%) R2,3,5–∞ 12.548 (4.7%) 14.29 (6.4%) 9.989 (5.4%) 11.446 (3.3%) R3–∞ 12.539 (4.6%) 14.013 (4.3%) 9.989 (5.4%) 11.514 (3.9%) R3,5–∞ 12.791 (6.7%) 14.221 (5.8%) 10.313 (8.8%) 11.782 (6.4%) R4–∞ 13.206 (10.2%) 14.584 (8.5%) 10.809 (14%) 12.233 (10.4%) Figure 5: The distribution of text word frequencies and codeword lengths The average decoding time is given in Table 3. The decoding algorithm was implemented in C programming language and compiled with gcc compiler, -O3 time optimization enabled. The results of decompression were stored in RAM as arrays of decoded numbers. This approach is more indicative than the full decoding with restoring an original text, since converting numbers to strings is quite time consuming and neutralizes the difference between methods performance. Given values represent average time of 100 runs of each algorithm on the PC with AMD Athlon II X2 245 2.9GHz processor, 64K L1 cache, 1MB L2 cache, 4GB RAM, running OS Windows 10. 218 For each text, to measure the decoding time, we choose RMD-code maximizing the compression ratio of the text, according to Table 2. As seen, the RMD-codes decoding appears to be 40–80% slower than RPBC decoding, 20–30% slower than SCDC decoding and almost 2 times faster than the decoding of Fibonacci code Fib3. It is worth to note that RPBC codes providing the best decoding time are not synchronizable and do not allow the direct search in a compressed file. Also, the results show that decoding time does not depend on language of text or Different words to Total words ratio but depends only on the size of a text and code type. Table 3 Average decoding time, milliseconds Code\Text Ukrainian Bible Ukrainian fiction English Bible, English corpus King James Formatted version Wikipedia Texts Fib3 11.7 27.8 10.87 281 SCDC 4.56 12.4 4.62 139 RPBC 3.46 11.1 3.47 101 RMD 6.06 15.1 6.25 173 5. Conclusions The reverse multi-delimiter data compression codes are defined and their application to English and Ukrainian natural language text compression is investigated. These codes are universal, complete, synchronizable and allowing the direct search in a compressed file. Although RMD-codes compress Ukrainian texts less efficiently than English, they provide better compression ratios for both languages than other codes with similar properties. In general, compression of Ukrainian texts appears to be more challenging because of flatter (i.e. more uniform) distribution of word frequencies in Ukrainian. However, since the reverse MD-codes are multiparameterized, they are agile and can be aligned either to English or to Ukrainian texts more tightly than other codes in research. Considering either compression ratio or decoding time, the reverse multi-delimiter codes can be considered as an attractive alternative to byte-aligned and Fibonacci codes in compression of texts in different languages. Construction of codes compressing Ukrainian texts on a level with English and having all mentioned above properties is an unconquered challenge. 6. References [1] A. Anisimov, I. Zavadskyi, Variable-length prefix codes with multiple delimiters. IEEE Transactions Information Theory 63.5 (2017): 2885–2895. [2] N. Brisaboa, A. Farina, G. Navarro, M. Esteller, (s,c)-dense coding: an optimized compression code for natural language text databases, in: Proc. Symposium on String Processing and Information Retrieval, SPIRE’03, 2003, pp. 122–136. [3] J. Culpepper, A. Moffat, Enhanced byte codes with restricted prefix properties, in: String Processing and Information Retrieval, SPIRE’05, 2005, pp. 1–12. [4] J. Duda, K. Tahboub, N. Gadgil, E. Delp, The use of asymmetric numeral systems as an accurate replacement for huffman coding, in: Proceedings of Picture Coding Symposium, 2015, pp. 65–69. [5] D. Huffman, A method for the construction of minimum-redundancy codes, in: Proc. IRE 40, 1952, pp. 1098–1101. [6] S.T. Klein, M.K. Ben-Nissan, On the usefulness of Fibonacci compression codes. Computer Journal 53.6 (2010): 701–716. [7] I. Zavadskyi, A. Anisimov, A family of data compression codes with multiple delimiters, in: Proceedings of the Prague Stringology Conference, 2016, pp. 71–84. [8] P. Elias, Universal codeword sets and representations of the integers. IEEE Transactions Information Theory vol. 21 (1975): 194–203. 219