-

1613-0073

Domenico Cantone

domenico.cantone@unict.it 0 1 2

Simone Faro

simone.faro@unict.it 0 1 2 0 Department of Mathematics and Computer Science 1 ICTCS'24: Italian Conference on Theoretical Computer Science 2 University of Catania , Viale A. Doria n.6, 95125, Catania , Italy

is the length of the encoded string. We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcessibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those ofer ed by other solutions existing in the literature. The SFDC scheme provides a fle xible and simple representation geared towards either practical eficiency or compression ratios, as required. For a text of length over an alphabet of size and a fixe d parameter , the access time of our proposed encoding is proportional to the length of the character's code-word, plus an expected (( − +3 − 3)/ +1) overhead, where is the -th

CEUR

ceur-ws.org = + () bits, where

1. Introduction

The problem of text compression involves modifying the representation of any given text in plain format (referred to as plain text) so that the output format requires less space (as little as possible) for its storage and, consequently, less time for its transmission. It is understood that compression schemes must guarantee that the plain text can be reconstructed exactly from its compressed representation. Compressed representations allow one also to speed up algorithmic computations, as they can make better use of the memory hierarchy available in modern PCs, reducing disk access time. In addition, they find applications in the fields of compressed data structures [ 2, 3 ] that permit the manipulation of data directly in their encoded form, heavily used in bioinformatics and computer science. Thus, we contend that compression schemes with direct accessibility to the elements of the encoded sequence are of fundamental importance.

We tacitly assume that, in an uncompressed text of length over an alphabet Σ of size , every symbol is represented by ⌈log ⌉ bits, for a total of ⌈log ⌉ bits.1 On the other hand, using a more eficient variable-length encoding, such as the Hufman’s optimal compression scheme [ 4 ], each character ∈ Σ can be represented with a code ( ), whose (variable) length ⋆ An extended version of this work is available in a public repository [ 1 ]. The extended version encompasses some initial experimental findings indicating that the performance of our approach is, to some extent, comparable to that of DACs and Wavelet Trees, which are considered among the most eficient schemes.

Overall Space SS DS IC WT DACs SFDC [ 6 ] (( log log )/(√︀0/ log ) + log ) + ()

Access to [] (ℎ ) (| ([])|) (log ) (| ([])|) (/((√︀0/))) (| ([])|) (expected) depends on the absolute frequency () of in the text , allowing the text to be represented with = ∑︀∈Σ () · | ()| ⩽ ⌈log ⌉ bits.

Specifically, the Hufman algorithm computes an optimal prefix code relative to given frequencies of the alphabet characters, so that decoding becomes particularly simple as it can take advantage of ordered binary trees, whose leaves are labeled with the alphabet characters and whose edges are labeled with 0 (left edges) or 1 (right edges) in such a way that the code-word of an alphabet character is the word labeling the branch from the root to the leaf labeled by the same character. Prefix code trees, as computed by the Hufman algorithm, are called Hufman trees. These are not unique, by any means. The usually preferred tree for a given set of frequencies, out of the various possible Hufman trees, is the one induced by canonical Hufman codes [ 5 ]. Such trees have the property that, the sequence of the leaves depths, scanned from left to right, is non-decreasing.

One of the biggest problems with variable-length codes, like Hufman codes, is the impossibility of directly accessing the -th code-word in the encoded string, for any 0 ⩽ < , since its position in the encoded text depends on the sum of the lengths of the encodings of the characters that precede it.

This is why, over the years, various encoding schemes have been proposed that are able to complement variable-length codes with direct access to the characters of the encoded sequence. Dense Sampling [ 6 ], Elias-Fano codes [ 7 ], Interpolative coding [ 8, 9 ], Wavelet Trees [ 10 ], and DACs [ 11, 12 ] are just some of the best known and most eficient solutions to the direct access problem. However, the possibility of directly accessing the encoding of any character of the text comes at a cost, in terms of additional space used for the encoding, ranging from ( log ) for Dense Sampling to ( ) for the case of Wavelet Trees. Table 1 summarises the main direct access variable length codes, indicating the number of bits required for the encoding and the access time to any character of the text.

In this paper, we present a new variable-length encoding format for integer sequences that support direct and fast access to any element of the compressed sequence. The proposed format, named SFDC (Succinct Format with Direct aCcessibility), is based on variable-length codes obtained from existing compression methods. Just for presentation purposes, in this paper we show how to construct our SFDC in the case of Hufman codes. Despite its apparent simplicity, which can be regarded as a value in itself, many interesting (and surprising) qualities emerge from our proposed encoding scheme. To the extent that we show in this paper, the SFDC encoding is relevant for the following reasons: conceptually simple and easy to implement model; ofered by other solutions in the literature; • it allows direct access to text characters in (expected) constant time, through a • it achieves compression ratios that, under suitable conditions, are superior to those • it ofers a flexible representation that can be adapted to the type of application at hand, where one can prefer, according to her needs, aspects relating to eficiency versus those relating to space consumption; • it is designed to naturally allow parallel accessing of multiple data, parallel-computation, and adaptive-access on textual data, allowing its direct application to various text processing tasks with the capability of improving performance up to a factor equal to the word length.

Unlike other direct-access coding schemes, SFDC is based on a model that ofers a constant access time in the expected case, when character frequencies have a low variability along the text. We can prove that, in the overall, our compression scheme uses a number of bits equal to + ︀( (︀ − ( +3 − 3)/ +1 ︀)

= + (), where is the -th Fibonacci number and is the length of the encoded string. In addition, our encoding allows accessing each character of the text in time proportional to the length of its encoding, plus an expected (( − +3 − 3)/ +1) time overhead.2 2. A New Succinct Format with Direct Accessibility code (). Without loss of generality, we assume that the alphabet Σ = In this section, we present in detail our proposed SFDC representation and the related encoding and decoding procedures. Again, let be a text of length , over an alphabet Σ of size , and let : Σ

→ {0, . . . , } be its corresponding frequency function. By running the Hufman algorithm on , we generate a set of optimal prefix binary codes for the characters of Σ , represented by the map : Σ → {0, 1}+. For any ∈ Σ , we denote by | ()| the length of the binary {0, 1, . . . , − 1} is arranged as an ordered set in non-decreasing order of frequencies, so that () ⩽ (+1) holds for 0 ⩽ < −

1. Thus, 0 and − 1 are the least and the most frequent characters in , respectively. For instance, based on the codes shown in Fig. 2, the string Compression, of length 11, is encoded by the binary sequence

1100011010· 1100111· 101· 0110101· 11101· 01· 001· 001· 11010· 1100111· 010 , where the individual character codes have been separated by grey dots to enhance readability.

Let max( ) := max{| ()| : ∈ Σ } be the length of the longest code of any character in Σ , and assume that is any fixed integer constant such that 2 ⩽ ⩽ max( ). The SFDC codes 2From an experimental point of view, it is shown in [ 1 ] that in practical cases our scheme is particularly eficient and, in some respects, is comparable with the performance of DACs [ 11, 12 ] and Wavelet Trees [ 10 ], which are among the most eficient schemes in the literature.

̂︀0 ̂︀1 ̂︀2 . . . ̂︀ − 2 any string of length as an ordered collection of binary strings representing − 1 fixed layers and an additional dynamic layer. The number of layers is particularly relevant for the encoding performance. We will refer to it as the size of the SFDC representation.

The first − 1 binary strings, denoted ̂︀0, ̂︀1, . . . , ̂︀ − 2, have length . Specifically, the -th binary string ̂︀ is the sequence of the -th bits (if present, 0 otherwise) of the encodings of the characters in , in the order in which they appear in . More formally, for 0 ⩽ ⩽ − 2, we have ̂︀ := ⟨︀ ([0])[], ([ 1 ])[], . . . , ([ − 1])[]⟩︀ , where we agree that ([])[] = 0 when ⩾ | ([])|, for 0 ⩽ < . Thus, each binary string ̂︀ can be regarded as an array of ⌈/⌉ binary blocks of size . We refer to the bit vectors ̂︀0, ̂︀1, . . . , ̂︀ − 2 as the fixed layers of the encoding, and to the bits stored in them as the fixed bits. The last layer of the encoding is a binary string ̂︀ := ̂︀ − 1, of length ⩾ , which gathers all the bits at positions − 1, , . . . of the encodings whose length exceeds − 1. We refer to such an additional layer as the dynamic layer, and to the bits in it as the pending bits.

Pending bits are arranged within the dynamic layer proceeding from left to right and following a last-in first-out (LIFO) scheme. Specifically, such a scheme obeys the following three rules: (a) If the encoding ([]) has more than one pending bit, then each bit ([])[ + 1] is stored on the dynamic layer ̂︀ at some position on the right of that at which the bit ([])[] is stored, for = − 1, , . . . , | ([])| − 1. (b) For 0 ⩽ < < , each pending bit of ([]) (if any) is stored in the dynamic layer either in a position such that ⩽ < or to the right of all positions at which the pending bits (if any) of ([]) have been stored. (c) Every pending bit is stored in the leftmost available position in the dynamic layer, complying with rules (a) and (b).

Fig. 2 shows the layout of the pending bits, arranged according to the LIFO scheme just illustrated. The fixed layers have a white background, while the dynamic layer has a grey one. In the middle, it is shown a relaxed 2-dimensional representation of the pending bits (i.e., the bits past position − 2) in the dynamic layer. On the right, the pending bits are arranged linearly along the dynamic layer, according to a last-in first-out approach. Idle bits are indicated with the symbol -. In the following sections, we will present the encoding and decoding procedures of the SFDC compression scheme. char code

length s e n p m C o i r 001 01 010 0110101 101 1100011010 1100111 11010 11101 3 2 3 7 3 4 5 5 10 ̂︀0 ̂︀1 ̂︀2 ̂︀3 ̂︀4 layer will remain idle. of the dynamic layer.

2.1. The Encoding Procedure

The encoding procedure (see Figure 3, on the left) takes as input a text of length , the mapping function generated by the Hufman compression algorithm, and the size of the SFDC representation. We recall that the mapping function associates each character ∈ Σ with a variable-length binary code, and that the set of codes { () | ∈ Σ } is prefix-free.

While iterating over all characters of the text, the encoding procedure uses a stack to implement the LIFO strategy for the arrangement of pending bits within the dynamic layer ̂︀. At the -th iteration, the algorithm takes care of inserting the first bits of the encoding ([]) in the fixed layers, up to a maximum of − 1 bits. If | ([])| < − 1, exactly − | ([])| − 1 bits in the fixed layers will remain idle. Conversely, when the length of the encoding of the character [] exceeds the value − 1, the remaining (pending) bits are pushed in reverse order into the stack . At the end of the -th iteration, the first bit extracted from the stack is stored in the dynamic layer. Should the stack be empty at the -th iteration, the -th bit of the dynamic At the end of the procedure, all the pending bits still in the stack, if any, are placed at the end Assuming that each of the layers of length used in the representation can be initialised at 0 in constant time (or at most in ⌈/⌉ steps), the execution time of the encoding algorithm for a text of length is trivially equal to the sum of the encoding lengths of each individual length of the encoded text. No additional time is required by the algorithm. character. Thus the encoding of a text of length is performed in ( )-time, where is the

Concerning the number of bits used by the SFDC representation, it can be proved that our compression scheme uses, in the overall, a number of bits equal to + ︀( (︀ − ( +3 − 3)/ +1 ︀)

= + (), where is the -th Fibonacci number.

2.2. The Decoding Procedure

Although character decoding from a text with a SFDC representation takes place via a direct access to the position where the encoding begins, it does not necessarily take constant time. In fact, during the decoding of a character [], it may be necessary to address some additional work related to a certain number of additional characters that have to be decoded in order to complete the full decoding of []. In particular, if the last bit of the encoding of [] is placed at position in the dynamic layer, where ⩾ , then the procedure is forced to decode all the characters from position to position of the text, [] included. We refer to such additional work, namely the − additional characters that must be decoded in order to complete the decoding of character [], as decoding delay. When no character other than [] needs to be decoded, we have = and the decoding delay is 0.

Despite the presence of decoding delays, when the whole text, or just a window of it, needs to be decoded, the additional work is not lost: it may indeed happen that a character at position is decoded before a character at position , with < . This happens when the pending bits of ([]) are stored after position .

We assume that the decoding procedure is invoked to decode the window [ .. ] of the text, with ⩽ . The procedure (see Figure 3, on the right) takes as input the SFDC representation of the text of length , the starting position and the ending position of the window to be decoded, the root of the Hufman tree, and the size of the SFDC representation. Note that in order to decode the single character at position , the procedure needs to be invoked with = . Likewise, to decode the entire text, the procedure must be invoked with = 0 and = − 1.

Decoding of a character [] is done by scanning a descending path on the Hufman tree, from the root to a leaf. Specifically, while scanning the bits of the encoding ([]), we descend into the left or right subtree, according to whether we find a bit equal to 0 or to 1, respectively. We assume that each node in the tree has a Boolean value associated with it, .leaf, which is set to True when the node is a leaf, False otherwise. We further assume that .left (resp., .right) allows one to access the left (resp., right) child of . If the node is a leaf, then it contains a parameter, .symbol, that allows one to access the character associated with the encoding.

Again, we maintain a stack to extract the pending bits in the dynamic layer, storing the nodes in the Hufman tree associated with the characters in the text for which decoding has started but is not yet complete. Within the stack, the node used for decoding the character [] is coupled with the position itself, so that the symbol can be positioned in constant time, once decoded. Thus, the elements of the stack are of the form ⟨, ⟩. In our description we denote by the node within the Hufman tree used for decoding the character []. Since we use a LIFO strategy for bit arrangements in the dynamic layer, when two nodes and , with 1 ⩽ < ⩽ , are contained in the stack, the node is closer than to the top of the stack.

Each iteration of the procedure explores vertically the -th bit of each layer, proceeding from the first layer ̂︀0 towards the last layer ̂︀, in an attempt to decode the -th character of the text, for ⩾ . The iterations stop when the entire window, [ .. ], has been fully decoded. This condition occurs when the character [] has been decoded (i.e., [] ̸=Nil) and the stack is empty, so that all characters preceding position in the window have already been decoded.

Each iteration is divided into two parts: the first part, which is executed only when < , explores the Hufman tree, starting from the root and proceeding towards the leaves, in the readbit(, ): 1. return ([] ≫ ) & 1 writebit(, , ): 1. ← | ( ≪ ( − )) encode(, , , ): 1. ← new Stack 2. ← 0 3. while < do 4. ℎ ← 0 5. while (ℎ < min(| ([])|, − 1) do 6. writebit(ℎ, , ([])[ℎ]) 7. ℎ ← ℎ + 1 8. for ← | ([])| − 1 downto ℎ do 9. .push( ([])[]) 10. if ( ̸= ∅) then 11. writebit(, , .pop()) 12. ← + 1 13. while ( ̸= ∅) do 14. writebit(, , .pop()) 15. ← + 1 16. return decode(, , , , root, ): 1. ← new Stack 2. [] ← Nil 3. ← 4. while ([] ̸= Nil) do 5. if ( < ) then 6. ← root 7. ℎ ← 0 8. while (ℎ < − 1 and not .leaf) do 9. if (readbit(ℎ[/], mod )) 10. then ← .right 11. else ← .left 12. if (.leaf) then [] ← .symbol 13. else .push(⟨, ⟩) 14. ℎ ← ℎ + 1 15. if ( ̸= ∅) then 16. ⟨, ⟩ ← .pop() 17. if (readbit([/], mod )) then 18. ← .right 19. else ← .left 20. if (.leaf) then [] ← .symbol 21. else .push(⟨, ⟩) 22. ← + 1 23. return [..] hope of directly decoding the -th character (this happens when | ([])| < ). If a leaf is reached at this stage, the text encoding is transcribed; otherwise, the node is kept on hold and pushed onto the stack. The second part of the loop, if necessary (i.e., when ̸= ∅), scans the bit in the dynamic layer and updates the node at the top of the stack accordingly. If the node reaches the position of a leaf, the corresponding symbol is transcribed to position [], and the node is deleted from the stack.

The time complexity required to decode a single character [] is (| ([])| + ), while decoding a text window [..] requires (∑︀

= | ([]|) + ), where is the decoding delay.

The decoding delay is a key feature for the use of SFDC in practical applications, since the expected access time to text characters is connected to it. This value depends on the distribution of the characters within the text and on the size of the representation. Ideally, the SFDC representation should use the minimum number of layers that ensures the expected value of the decoding delay to fall below a certain user-defined bound.

Since in general neither the frequency of the characters within the text nor their distribution is known in advance, it is not possible to define a priori the number of layers that is most suitable for the application at hand. However, one can compute the expected decoding delay value by simulating the construction of the SFDC representation on the text for a given number of layers. Should this value be above the bound, one can repeat the computation with a higher number of layers, until a value of is reached that guarantees an acceptable delay. Such a computation can be done by means of a procedure (see Figure 4) that simulates the construction of the representation SFDC for the text , without actually producing it.

Again, assuming that the layers of the SFDC can be initialised at 0 in constant time, the complexity of such procedure is equal to ( ), where we recall that = ∑︀ =−01 | ([])|.

Concerning complexity issues, it can be shown that, in the worst case, the summation ∑︀ =−0 − 1 () · (| ()| − ), where () denotes the relative frequency of each character ∈ Σ , provides an estimate of the expected delay of our SFDC encoding with layers (where 4 ⩽ ⩽ − 1). Additionally, it can be proved that −∑ ︁− 1 () · (| ()| − ) = − +3 − 3 ,

=0 +1 where is the -th Fibonacci number. As a consequence, the SFDC encoding allows accessing each character of the text in time proportional to the length of its encoding, plus an expected (( − +3 − 3)/ +1) time overhead, which is less than 1.0 when ⩾ 4.

3. Complexity Issues

In this section, we present a theoretical analysis of the performance of our SFDC encoding, in terms of expected values of decoding delay and average number of idle bits. The first value is related to the average access time, while the second one gives us an estimate of the additional space used by the encoding with respect to the minimum number of bits required to compress the sequence. In our analysis, we make some simplifying assumptions that do not significantly impact the overall results. When necessary, in order to support the reasonableness of our arguments, we will compare theoretical results with experimentally obtained data, reproducing the same conditions as in the theoretical analysis.

Before entering into the details of our analysis, we state two useful identities related to Fibonacci numbers, which hold for every ⩾ 0, that can be proved by induction on : ∑︁ = +2 − 1 , =0 ∑︁ = +2 − +3 + 2 . =0 (1) (2) For convenience, we define a slight modification of the Fibonacci sequence, by putting := {︃1

Hence, by (1), we have ∑︀=0 = +2, for every ∈ N.

Let Σ := {0, 1, . . . , − 1} be an alphabet of size , and assume that the characters in Σ are ordered in non-decreasing order with respect to their frequency in a given text (of length ), namely, () ⩽ (+1) holds, for 0 ⩽ < − 1, where : Σ → N+ is the frequency function.

In order to keep the number of bits used by our encoding scheme as low as possible, it is useful to choose the number of layers as close as possible to the expected value 1 − 1 := · ∑︁ | ([])|

=0 of the length of the Hufman encodings of the characters in the text . In particular, if we set = ⌈ ⌉, it is guaranteed that the number of idle bits, equal to ( − ), is the minimum possible. As already observed, should the decoding delay prove to be too high in practical applications, the value of can be increased.

In terms of decoding delay, the optimal case for the application of the SFDC method is when all the characters of the alphabet have the same frequency. In this circumstance, in fact, the character Hufman encodings have length between ⌊log ⌋ and ⌈log ⌉.3 Hence, in this case it sufices to use a number of layers equal to = ⌈log ⌉ to obtain a guaranteed direct access to each character of the text, hence a decoding delay equal to 0.

On the other side of the spectrum, the worst case occurs when the Hufman tree, in its canonical configuration, is completely unbalanced. In this case, in fact, the diference between the expected value and the maximum length of the Hufman encodings, namely ︀( max∈Σ | ()|)︀ − , is as large as possible, thus causing the maximum possible increase of the average decoding delay. In addition, we observe that the presence of several characters in the text whose encoding length is less than does not afect positively the expected value of 3More precisely there are 2 − 2⌊log ⌋+1 characters with encodings of length ⌈log ⌉ + 1 and 2⌈log ⌉+1 − characters with encodings of length ⌈log ⌉. the delay, since pending bits are placed exclusively in the dynamic layer. Rather, the presence of such short-code characters increases the number of idle bits of the encoding.

We call such a completely unbalanced tree a degenerate tree. Among all the possible texts whose character frequency induces a degenerate Hufman coding tree, the ones that represent the the frequencies follow a Fibonacci-like distribution of the following form worst case for our encoding scheme are those in which the frequency diferences ()− (− 1), for = 1, 2, . . . , − 1 are minimal. It is not hard to verify that such a condition occurs when () := {︃1 () = ()/ +1, for = 0, 1, . . . , − 1. to the frequency function (3), with

Thus, let us assume to be a random text over Σ with the frequency function (3) (so that, by → [ 0, 1 ] be the relative frequency function of the text , where Assume also that : Σ → {0, 1}+ is the canonical degenerate Hufman encoding of Σ relative | ()| = {︃ − 1 if = 0 − if 1 ⩽ ⩽ − 1. +3− 3 .

+1 Lemma 1. The expected encoding length [| ([])|] by of the characters in is equal to Proof 1. Using (1) and (2), we have: [| ([])|] = ∑︁ () · | ()| = ∑︁ ()

· | ()| − 1 =0 +1 ︃( − 1 =1 ∑︁( − ) + − 1 )︃ =

As a consequence of the above, the expected number of idle bits per character in a SFDC encoding using layers can be estimated by the following expression − ( +3 − 3)/ +1. a text over an alphabet with Fibonacci frequencies, comparing theoretical values against values experimentally computed. To obtain the experimental values, we used artificially generated texts of 100 MB whose character frequency results in a Fibonacci frequency. = = = − 1 =0 Fibonacci frequencies. Values are shown for 5 ⩽ ⩽ 8 and ∈ {10, 20, 30}.

As the size of the alphabet increases, it is easy to verify that the function quickly converges to the value − 2.618. Indeed, by recalling that lim / = 1, where = 21 (1 + √5) is the golden ratio, by Lemma 1 we have →∞ →∞ lim [| ([])|] = lim →∞

= + (), where := ∑︀ =−01 () · | ()|.

Consequently, if we assume that the value represents a constant implementation-related parameter, the total space used by SFDC for encoding a sequence of characters is equal to

3.1. Expected Decoding Delay

We claim that ∑︀ − − 1 () · (| ()| − ).

=0 On the basis of what was shown above, we now estimate the expected value of the decoding delay, i.e., the number of additional characters that need to be decoded in order to obtain the full encoding of a character of the text.

Assume that the -th character of the text is [] = , where 0 ⩽ < and 0 ⩽ < . We want to estimate the position ⩾ at which the last pending bit of the encoding ([]) is placed, so that the delay of character [] is − .

Let us assume that the number of layers of the representation is equal to . If the length of the encoding of [] falls within the number of layers, namely | ([])| ⩽ , then the character can be decoded in a single cycle and the delay is 0. On the other hand, if | ([])| > , then | ([])| − bits remain to be read, all arranged within the dynamic layer. Since the length of the encoding of any text character can be approximated in the average case to the value 2.618, if we assume to use a fixed number of layers ⩾ 4, we expect that on average the pending bits of ([]) will be placed in the | ([])| − positions past position . The average delay in this case can be estimated by | ([])| − .

Thus, for an alphabet of characters with frequency function (3), the expected delay of our Fibonacci encoding with layers (where 4 ⩽ ⩽ − 1) can be estimated by the summation: ∑︁ () · (| ()| − ) = The average delay in decoding a single character in a text over an alphabet with Fibonacci frequencies using SFDC. Theoretical values (on the left) compared with experimental values (on the right). Values are tabulated for 5 ⩽ ⩽ 8 and ∈ {10, 15, 20, 25, 30}. ∑︁ · (| ()| − ) ∑︁ · ( − − ) + − − 1 10 Fibonacci frequencies using SFDC, comparing values resulting from the identity (4) against values computed experimentally. To obtain the experimental values, we generated texts of 100 MB whose character frequency is the Fibonacci frequency (3). For each such text , the average delay value was experimentally obtained by accessing all the characters in random order, computing the delays, and dividing the sum by the total number of characters in .

4. Conclusions and future works

In this paper, we have introduced a data reorganisation technique that, when applied to variablelength codes, allows easy, direct, and fast access to any character of the encoded text, in time proportional to the length of its code-word, plus an additional overhead that is constant in many practical cases. Besides being extremely simple to be translated into a computer program and eficient in terms of space and time, our method has the surprising feature of being also computation-friendly. We expect, in fact, that it will turn out to be particularly suitable for applications in text processing. Our future work will be devoted to demonstrating such a claim. = = = =

[1]

Cantone , S. Faro, The many qualities of a new directly accessible compression scheme , CoRR abs/2303 .18063 ( 2023 ). URL: https://doi.org/10.48550/arXiv.2303.18063. doi: 10 .48550/ARXIV.2303.18063. arXiv: 2303 . 18063 .

[2]

Mofat ,

Turpin , Compression and

Coding

Algorithms , volume 669 of The international series in engineering and computer science, Kluwer, 2002 . URL: http://www.cs.mu.oz.au/ caca/.

[3]

Salomon , Variable-length Codes for Data Compression , Springer-Verlag, Berlin, Germany / Heidelberg, Germany / London, UK / etc., 2007 .

[4]

D. A.

Hufman , A method for the construction of minimum-redundancy codes , Proceedings of the Institute of Radio Engineers 40 ( 1952 ) 1098 - 1101 .

[5]

E. S.

Schwartz ,

Kallick , Generating a canonical prefix encoding , Commun. ACM 7 ( 1964 ) 166 - 169 . URL: https://doi.org/10.1145/363958.363991. doi: 10 .1145/363958.363991.

[6]

Ferragina ,

Venturini , A simple storage scheme for strings achieving entropy bounds , Theor. Comput. Sci . 372 ( 2007 ) 115 - 121 . URL: https://doi.org/10.1016/j.tcs. 2006 . 12 .012. doi: 10 .1016/j.tcs. 2006 . 12 .012.

[7]

Elias , Eficient storage and retrieval by content and address of static files , J. ACM 21 ( 1974 ) 246 - 260 . URL: https://doi.org/10.1145/321812.321820. doi: 10 .1145/321812.321820.

[8]

Mofat , L. Stuiver, Binary interpolative coding for efective index compression , Inf. Retr . 3 ( 2000 ) 25 - 47 . URL: https://doi.org/10.1023/A:1013002601898. doi: 10 .1023/A: 1013002601898 .

[9]

Teuhola , Interpolative coding of integer sequences supporting log-time random access , Inf. Process. Manag . 47 ( 2011 ) 742 - 761 . URL: https://doi.org/10.1016/j.ipm. 2010 . 11 .006. doi: 10 .1016/j.ipm. 2010 . 11 .006.

[10]

Grossi ,

Gupta ,

J. S.

Vitter , High-order entropy-compressed text indexes , in: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , ACM/SIAM, 2003 , pp. 841 - 850 . URL: http://dl.acm.org/citation.cfm?id= 644108 . 644250 .

[11]

Brisaboa ,

Ladra , G. Navarro, Directly addressable variable-length codes , in: SPIRE 2009 , volume 5721 of LNCS , Springer, 2009 , pp. 122 - 130 . URL: https://doi.org/10.1007/ 978-3- 642 -03784-9_ 12 . doi: 10 .1007/978-3- 642 -03784-9\_ 12 .

[12]

N. R.

Brisaboa ,

Ladra , G. Navarro, Dacs: Bringing direct access to variable-length codes , Inf. Process. Manag . 49 ( 2013 ) 392 - 404 . URL: https://doi.org/10.1016/j.ipm. 2012 . 08 .003. doi: 10 .1016/j.ipm. 2012 . 08 .003.