<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Domenico Cantone</string-name>
          <email>domenico.cantone@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Faro</string-name>
          <email>simone.faro@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ICTCS'24: Italian Conference on Theoretical Computer Science</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Catania</institution>
          ,
          <addr-line>Viale A. Doria n.6, 95125, Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>is the length of the encoded string. We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcessibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those ofer ed by other solutions existing in the literature. The SFDC scheme provides a fle xible and simple representation geared towards either practical eficiency or compression ratios, as required. For a text of length  over an alphabet of size  and a fixe d parameter  , the access time of our proposed encoding is proportional to the length of the character's code-word, plus an expected (( −  +3 − 3)/ +1) overhead, where  is the -th</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
=  + () bits, where</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The problem of text compression involves modifying the representation of any given text in
plain format (referred to as plain text) so that the output format requires less space (as little as
possible) for its storage and, consequently, less time for its transmission. It is understood that
compression schemes must guarantee that the plain text can be reconstructed exactly from its
compressed representation. Compressed representations allow one also to speed up algorithmic
computations, as they can make better use of the memory hierarchy available in modern PCs,
reducing disk access time. In addition, they find
applications in the fields of compressed data
structures [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] that permit the manipulation of data directly in their encoded form, heavily
used in bioinformatics and computer science. Thus, we contend that compression schemes with
direct accessibility to the elements of the encoded sequence are of fundamental importance.
      </p>
      <p>
        We tacitly assume that, in an uncompressed text  of length  over an alphabet Σ of size  ,
every symbol is represented by ⌈log  ⌉ bits, for a total of ⌈log  ⌉ bits.1 On the other hand,
using a more eficient
variable-length encoding, such as the Hufman’s
optimal compression
scheme [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], each character  ∈ Σ can be represented with a code ( ), whose (variable) length
⋆
An extended version of this work is available in a public repository [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The extended version encompasses some
initial experimental findings indicating that the performance of our approach is, to some extent, comparable to
that of DACs and Wavelet Trees, which are considered among the most eficient
schemes.
      </p>
      <p>
        Overall Space
SS
DS
IC
WT
DACs
SFDC
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
(( log log  )/(√︀0/ log  ) + log  )
 + ()
      </p>
      <p>Access to []
(ℎ )
(| ([])|)
(log )
(| ([])|)
(/((√︀0/)))
(| ([])|) (expected)
depends on the absolute frequency  () of  in the text , allowing the text to be represented
with  = ∑︀∈Σ  () · |  ()| ⩽ ⌈log  ⌉ bits.</p>
      <p>
        Specifically, the Hufman algorithm computes an optimal prefix code relative to given
frequencies of the alphabet characters, so that decoding becomes particularly simple as it can take
advantage of ordered binary trees, whose leaves are labeled with the alphabet characters and
whose edges are labeled with 0 (left edges) or 1 (right edges) in such a way that the code-word
of an alphabet character is the word labeling the branch from the root to the leaf labeled by
the same character. Prefix code trees, as computed by the Hufman algorithm, are called
Hufman trees. These are not unique, by any means. The usually preferred tree for a given set of
frequencies, out of the various possible Hufman trees, is the one induced by canonical Hufman
codes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Such trees have the property that, the sequence of the leaves depths, scanned from
left to right, is non-decreasing.
      </p>
      <p>One of the biggest problems with variable-length codes, like Hufman codes, is the
impossibility of directly accessing the -th code-word in the encoded string, for any 0 ⩽  &lt; , since
its position in the encoded text depends on the sum of the lengths of the encodings of the
characters that precede it.</p>
      <p>
        This is why, over the years, various encoding schemes have been proposed that are able to
complement variable-length codes with direct access to the characters of the encoded sequence.
Dense Sampling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Elias-Fano codes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Interpolative coding [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], Wavelet Trees [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and
DACs [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] are just some of the best known and most eficient solutions to the direct access
problem. However, the possibility of directly accessing the encoding of any character of the text
comes at a cost, in terms of additional space used for the encoding, ranging from ( log  )
for Dense Sampling to ( ) for the case of Wavelet Trees. Table 1 summarises the main direct
access variable length codes, indicating the number of bits required for the encoding and the
access time to any character of the text.
      </p>
      <p>In this paper, we present a new variable-length encoding format for integer sequences that
support direct and fast access to any element of the compressed sequence. The proposed format,
named SFDC (Succinct Format with Direct aCcessibility), is based on variable-length codes
obtained from existing compression methods. Just for presentation purposes, in this paper we
show how to construct our SFDC in the case of Hufman codes. Despite its apparent simplicity,
which can be regarded as a value in itself, many interesting (and surprising) qualities emerge
from our proposed encoding scheme. To the extent that we show in this paper, the SFDC
encoding is relevant for the following reasons:
conceptually simple and easy to implement model;
ofered by other solutions in the literature;
• it allows direct access to text characters in (expected) constant time, through a
• it achieves compression ratios that, under suitable conditions, are superior to those
• it ofers a flexible representation that can be adapted to the type of application at hand,
where one can prefer, according to her needs, aspects relating to eficiency versus those
relating to space consumption;
• it is designed to naturally allow parallel accessing of multiple data, parallel-computation,
and adaptive-access on textual data, allowing its direct application to various text
processing tasks with the capability of improving performance up to a factor equal to the
word length.</p>
      <p>Unlike other direct-access coding schemes, SFDC is based on a model that ofers a constant
access time in the expected case, when character frequencies have a low variability along
the text. We can prove that, in the overall, our compression scheme uses a number of bits
equal to  + ︀(  (︀  −
( +3 − 3)/ +1
︀)</p>
      <p>=  + (), where  is the -th Fibonacci
number and  is the length of the encoded string. In addition, our encoding allows accessing
each character of the text in time proportional to the length of its encoding, plus an expected
(( −  +3 − 3)/ +1) time overhead.2
2. A New Succinct Format with Direct Accessibility
code  (). Without loss of generality, we assume that the alphabet Σ =
In this section, we present in detail our proposed SFDC representation and the related encoding
and decoding procedures. Again, let  be a text of length , over an alphabet Σ of size  , and let
 : Σ</p>
      <p>→ {0, . . . , } be its corresponding frequency function. By running the Hufman algorithm
on , we generate a set of optimal prefix binary codes for the characters of Σ , represented
by the map  : Σ
→ {0, 1}+. For any  ∈ Σ , we denote by | ()| the length of the binary
{0, 1, . . . ,  − 1} is
arranged as an ordered set in non-decreasing order of frequencies, so that  () ⩽  (+1)
holds for 0 ⩽  &lt;  −</p>
      <p>1. Thus, 0 and  − 1 are the least and the most frequent characters in
, respectively. For instance, based on the codes shown in Fig. 2, the string Compression, of
length 11, is encoded by the binary sequence</p>
      <p>1100011010· 1100111· 101· 0110101· 11101· 01· 001· 001· 11010· 1100111· 010 ,
where the individual character codes have been separated by grey dots to enhance readability.</p>
      <p>
        Let max( ) := max{| ()| :  ∈ Σ } be the length of the longest code of any character in Σ ,
and assume that  is any fixed integer constant such that 2 ⩽  ⩽ max( ). The SFDC codes
2From an experimental point of view, it is shown in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that in practical cases our scheme is particularly eficient
and, in some respects, is comparable with the performance of DACs [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] and Wavelet Trees [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which are
among the most eficient schemes in the literature.
      </p>
      <p>̂︀0
̂︀1
̂︀2
.
.
.
̂︀ − 2
any string  of length  as an ordered collection of  binary strings representing  − 1 fixed
layers and an additional dynamic layer. The number  of layers is particularly relevant for the
encoding performance. We will refer to it as the size of the SFDC representation.</p>
      <p>
        The first  − 1 binary strings, denoted ̂︀0, ̂︀1, . . . , ̂︀ − 2, have length . Specifically, the -th
binary string ̂︀ is the sequence of the -th bits (if present, 0 otherwise) of the encodings of the
characters in , in the order in which they appear in . More formally, for 0 ⩽  ⩽  − 2, we have
̂︀ := ⟨︀  ([0])[],  ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ])[], . . . ,  ([ − 1])[]⟩︀ ,
where we agree that  ([])[] = 0 when  ⩾ | ([])|, for 0 ⩽  &lt; . Thus, each binary string
̂︀ can be regarded as an array  of ⌈/⌉ binary blocks of size . We refer to the bit vectors
̂︀0, ̂︀1, . . . , ̂︀ − 2 as the fixed layers of the encoding, and to the bits stored in them as the fixed
bits. The last layer of the encoding is a binary string ̂︀ := ̂︀ − 1, of length  ⩾ , which
gathers all the bits at positions  − 1, , . . . of the encodings whose length exceeds  − 1. We
refer to such an additional layer as the dynamic layer, and to the bits in it as the pending bits.
      </p>
      <p>Pending bits are arranged within the dynamic layer proceeding from left to right and following
a last-in first-out (LIFO) scheme. Specifically, such a scheme obeys the following three rules: (a)
If the encoding  ([]) has more than one pending bit, then each bit  ([])[ + 1] is stored on
the dynamic layer ̂︀ at some position on the right of that at which the bit  ([])[] is stored,
for  =  − 1, , . . . , | ([])| − 1. (b) For 0 ⩽  &lt;  &lt; , each pending bit of  ([]) (if any)
is stored in the dynamic layer either in a position  such that  ⩽  &lt;  or to the right of all
positions at which the pending bits (if any) of  ([]) have been stored. (c) Every pending bit is
stored in the leftmost available position in the dynamic layer, complying with rules (a) and (b).</p>
      <p>Fig. 2 shows the layout of the pending bits, arranged according to the LIFO scheme just
illustrated. The fixed layers have a white background, while the dynamic layer has a grey one.
In the middle, it is shown a relaxed 2-dimensional representation of the pending bits (i.e., the bits
past position  − 2) in the dynamic layer. On the right, the pending bits are arranged linearly
along the dynamic layer, according to a last-in first-out approach. Idle bits are indicated with
the symbol -. In the following sections, we will present the encoding and decoding procedures
of the SFDC compression scheme.
char code</p>
      <p>length
s
e
n
p
m
C
o
i
r
001
01
010
0110101
101
1100011010
1100111
11010
11101
3
2
3
7
3
4
5
5
10
̂︀0
̂︀1
̂︀2
̂︀3
̂︀4
layer will remain idle.
of the dynamic layer.</p>
      <sec id="sec-2-1">
        <title>2.1. The Encoding Procedure</title>
        <p>The encoding procedure (see Figure 3, on the left) takes as input a text  of length , the
mapping function  generated by the Hufman compression algorithm, and the size  of the
SFDC representation. We recall that the mapping function  associates each character  ∈ Σ
with a variable-length binary code, and that the set of codes { () | ∈ Σ } is prefix-free.</p>
        <p>While iterating over all  characters of the text, the encoding procedure uses a stack  to
implement the LIFO strategy for the arrangement of pending bits within the dynamic layer ̂︀.
At the -th iteration, the algorithm takes care of inserting the first bits of the encoding  ([])
in the fixed layers, up to a maximum of  − 1 bits. If | ([])| &lt;  − 1, exactly  − |  ([])| − 1
bits in the fixed layers will remain idle. Conversely, when the length of the encoding of the
character [] exceeds the value  − 1, the remaining (pending) bits are pushed in reverse order
into the stack . At the end of the -th iteration, the first bit extracted from the stack is stored
in the dynamic layer. Should the stack be empty at the -th iteration, the -th bit of the dynamic
At the end of the procedure, all the pending bits still in the stack, if any, are placed at the end
Assuming that each of the layers of length  used in the representation can be initialised at
0 in constant time (or at most in ⌈/⌉ steps), the execution time of the encoding algorithm
for a text  of length  is trivially equal to the sum of the encoding lengths of each individual
length of the encoded text. No additional time is required by the algorithm.
character. Thus the encoding of a text of length  is performed in ( )-time, where  is the</p>
        <p>Concerning the number of bits used by the SFDC representation, it can be proved
that our compression scheme uses, in the overall, a number of bits equal to  +
︀(  (︀  − ( +3 − 3)/ +1
︀)</p>
        <p>=  + (), where  is the -th Fibonacci number.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Decoding Procedure</title>
        <p>Although character decoding from a text with a SFDC representation takes place via a direct
access to the position where the encoding begins, it does not necessarily take constant time. In
fact, during the decoding of a character [], it may be necessary to address some additional
work related to a certain number of additional characters that have to be decoded in order to
complete the full decoding of []. In particular, if the last bit of the encoding of [] is placed
at position  in the dynamic layer, where  ⩾ , then the procedure is forced to decode all the
characters from position  to position  of the text, [] included. We refer to such additional
work, namely the  −  additional characters that must be decoded in order to complete the
decoding of character [], as decoding delay. When no character other than [] needs to be
decoded, we have  =  and the decoding delay is 0.</p>
        <p>Despite the presence of decoding delays, when the whole text, or just a window of it, needs
to be decoded, the additional work is not lost: it may indeed happen that a character at position
 is decoded before a character at position , with  &lt; . This happens when the pending bits of
 ([]) are stored after position .</p>
        <p>We assume that the decoding procedure is invoked to decode the window [ .. ] of the text,
with  ⩽ . The procedure (see Figure 3, on the right) takes as input the SFDC representation 
of the text  of length , the starting position  and the ending position  of the window to be
decoded, the root of the Hufman tree, and the size  of the SFDC representation. Note that in
order to decode the single character at position , the procedure needs to be invoked with  = .
Likewise, to decode the entire text, the procedure must be invoked with  = 0 and  =  − 1.</p>
        <p>Decoding of a character [] is done by scanning a descending path on the Hufman tree, from
the root to a leaf. Specifically, while scanning the bits of the encoding  ([]), we descend into
the left or right subtree, according to whether we find a bit equal to 0 or to 1, respectively. We
assume that each node  in the tree has a Boolean value associated with it, .leaf, which is set
to True when the node  is a leaf, False otherwise. We further assume that .left (resp., .right)
allows one to access the left (resp., right) child of . If the node  is a leaf, then it contains a
parameter, .symbol, that allows one to access the character associated with the encoding.</p>
        <p>Again, we maintain a stack to extract the pending bits in the dynamic layer, storing the nodes
in the Hufman tree associated with the characters in the text for which decoding has started
but is not yet complete. Within the stack, the node  used for decoding the character [] is
coupled with the position  itself, so that the symbol can be positioned in constant time, once
decoded. Thus, the elements of the stack are of the form ⟨, ⟩. In our description we denote
by  the node within the Hufman tree used for decoding the character []. Since we use
a LIFO strategy for bit arrangements in the dynamic layer, when two nodes  and  , with
1 ⩽  &lt;  ⩽ , are contained in the stack, the node  is closer than  to the top of the stack.</p>
        <p>Each iteration of the procedure explores vertically the -th bit of each layer, proceeding from
the first layer ̂︀0 towards the last layer ̂︀, in an attempt to decode the -th character of the
text, for  ⩾ . The iterations stop when the entire window, [ .. ], has been fully decoded.
This condition occurs when the character [] has been decoded (i.e., [] ̸=Nil) and the stack
is empty, so that all characters preceding position  in the window have already been decoded.</p>
        <p>Each iteration is divided into two parts: the first part, which is executed only when  &lt; ,
explores the Hufman tree, starting from the root and proceeding towards the leaves, in the
readbit(, ):
1. return ([] ≫ ) &amp; 1
writebit(, , ):
1.  ←  | ( ≪ ( − ))
encode(, , ,  ):
1.  ← new Stack
2.  ← 0
3. while  &lt;  do
4. ℎ ← 0
5. while (ℎ &lt; min(| ([])|,  − 1) do
6. writebit(ℎ, ,  ([])[ℎ])
7. ℎ ← ℎ + 1
8. for  ← |  ([])| − 1 downto ℎ do
9. .push( ([])[])
10. if ( ̸= ∅) then
11. writebit(, , .pop())
12.  ←  + 1
13. while ( ̸= ∅) do
14. writebit(, , .pop())
15.  ←  + 1
16. return 
decode(, , , , root,  ):
1.  ← new Stack
2. [] ← Nil
3.  ← 
4. while ([] ̸= Nil) do
5. if ( &lt; ) then
6.  ← root
7. ℎ ← 0
8. while (ℎ &lt;  − 1 and not .leaf) do
9. if (readbit(ℎ[/],  mod ))
10. then  ← .right
11. else  ← .left
12. if (.leaf) then [] ← .symbol
13. else .push(⟨, ⟩)
14. ℎ ← ℎ + 1
15. if ( ̸= ∅) then
16. ⟨, ⟩ ← .pop()
17. if (readbit([/],  mod )) then
18.  ← .right
19. else  ← .left
20. if (.leaf) then [] ← .symbol
21. else .push(⟨, ⟩)
22.  ←  + 1
23. return [..]
hope of directly decoding the -th character (this happens when | ([])| &lt;  ). If a leaf is
reached at this stage, the text encoding is transcribed; otherwise, the node  is kept on hold
and pushed onto the stack. The second part of the loop, if necessary (i.e., when  ̸= ∅), scans
the bit in the dynamic layer  and updates the node  at the top of the stack accordingly. If
the node  reaches the position of a leaf, the corresponding symbol is transcribed to position
[], and the node is deleted from the stack.</p>
        <p>The time complexity required to decode a single character [] is (| ([])| + ), while
decoding a text window [..] requires (∑︀</p>
        <p>= | ([]|) + ), where  is the decoding delay.</p>
        <p>The decoding delay is a key feature for the use of SFDC in practical applications, since the
expected access time to text characters is connected to it. This value depends on the distribution
of the characters within the text and on the size  of the representation. Ideally, the SFDC
representation should use the minimum number of layers that ensures the expected value of
the decoding delay to fall below a certain user-defined bound.</p>
        <p>Since in general neither the frequency of the characters within the text nor their distribution
is known in advance, it is not possible to define a priori the number of layers that is most
suitable for the application at hand. However, one can compute the expected decoding delay
value by simulating the construction of the SFDC representation on the text for a given number
of layers. Should this value be above the bound, one can repeat the computation with a higher
number of layers, until a value of  is reached that guarantees an acceptable delay. Such a
computation can be done by means of a procedure (see Figure 4) that simulates the construction
of the representation SFDC for the text , without actually producing it.</p>
        <p>Again, assuming that the  layers of the SFDC can be initialised at 0 in constant time, the
complexity of such procedure is equal to ( ), where we recall that  = ∑︀
 =−01 | ([])|.</p>
        <p>Concerning complexity issues, it can be shown that, in the worst case, the summation
∑︀ =−0 − 1  () · (| ()| −  ), where  () denotes the relative frequency of each character
 ∈ Σ , provides an estimate of the expected delay of our SFDC encoding with  layers (where
4 ⩽  ⩽  − 1). Additionally, it can be proved that
 −∑ ︁− 1  () · (| ()| −  ) =  −  +3 − 3 ,</p>
        <p>=0  +1
where  is the -th Fibonacci number. As a consequence, the SFDC encoding allows accessing
each character of the text in time proportional to the length of its encoding, plus an expected
(( −  +3 − 3)/ +1) time overhead, which is less than 1.0 when  ⩾ 4.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Complexity Issues</title>
      <p>In this section, we present a theoretical analysis of the performance of our SFDC encoding, in
terms of expected values of decoding delay and average number of idle bits. The first value is
related to the average access time, while the second one gives us an estimate of the additional
space used by the encoding with respect to the minimum number of bits required to compress
the sequence. In our analysis, we make some simplifying assumptions that do not significantly
impact the overall results. When necessary, in order to support the reasonableness of our
arguments, we will compare theoretical results with experimentally obtained data, reproducing
the same conditions as in the theoretical analysis.</p>
      <p>Before entering into the details of our analysis, we state two useful identities related to
Fibonacci numbers, which hold for every  ⩾ 0, that can be proved by induction on :

∑︁  = +2 − 1 ,
=0

∑︁  = +2 − +3 + 2 .
=0
(1)
(2)
For convenience, we define a slight modification of the Fibonacci sequence, by putting
 :=
{︃1</p>
      <p>Hence, by (1), we have ∑︀=0  = +2, for every  ∈ N.</p>
      <p>Let Σ := {0, 1, . . . ,  − 1} be an alphabet of size  , and assume that the characters in Σ are
ordered in non-decreasing order with respect to their frequency in a given text  (of length
), namely,  () ⩽  (+1) holds, for 0 ⩽  &lt;  − 1, where  : Σ → N+ is the frequency
function.</p>
      <p>In order to keep the number of bits used by our encoding scheme as low as possible, it is
useful to choose the number  of layers as close as possible to the expected value
1 − 1
 :=  · ∑︁ | ([])|</p>
      <p>=0
of the length of the Hufman encodings of the characters in the text . In particular, if we set
 = ⌈ ⌉, it is guaranteed that the number of idle bits, equal to ( −  ), is the minimum
possible. As already observed, should the decoding delay prove to be too high in practical
applications, the value of  can be increased.</p>
      <p>In terms of decoding delay, the optimal case for the application of the SFDC method is when
all the characters of the alphabet have the same frequency. In this circumstance, in fact, the
character Hufman encodings have length between ⌊log  ⌋ and ⌈log  ⌉.3 Hence, in this case it
sufices to use a number of layers equal to  = ⌈log  ⌉ to obtain a guaranteed direct access to
each character of the text, hence a decoding delay equal to 0.</p>
      <p>On the other side of the spectrum, the worst case occurs when the Hufman tree, in its
canonical configuration, is completely unbalanced. In this case, in fact, the diference
between the expected value  and the maximum length of the Hufman encodings, namely
︀( max∈Σ | ()|)︀ −  , is as large as possible, thus causing the maximum possible increase of
the average decoding delay. In addition, we observe that the presence of several characters in
the text whose encoding length is less than  does not afect positively the expected value of
3More precisely there are 2 − 2⌊log  ⌋+1 characters with encodings of length ⌈log  ⌉ + 1 and 2⌈log  ⌉+1 − 
characters with encodings of length ⌈log  ⌉.
the delay, since pending bits are placed exclusively in the dynamic layer. Rather, the presence
of such short-code characters increases the number of idle bits of the encoding.</p>
      <p>We call such a completely unbalanced tree a degenerate tree. Among all the possible texts
whose character frequency induces a degenerate Hufman coding tree, the ones that represent the
the frequencies follow a Fibonacci-like distribution of the following form
worst case for our encoding scheme are those in which the frequency diferences  ()−  (− 1),
for  = 1, 2, . . . ,  − 1 are minimal. It is not hard to verify that such a condition occurs when
 () :=
{︃1
 () =  ()/ +1, for  = 0, 1, . . . ,  − 1.
to the frequency function (3), with</p>
      <p>
        Thus, let us assume  to be a random text over Σ
with the frequency function (3) (so that, by
→ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] be the relative frequency function of the text , where
Assume also that  : Σ
→ {0, 1}+ is the canonical degenerate Hufman encoding of Σ relative
| ()| =
{︃
 − 1 if  = 0
 −  if 1 ⩽  ⩽  − 1.
 +3− 3 .
      </p>
      <p>+1
Lemma 1. The expected encoding length [| ([])|] by  of the characters in  is equal to
Proof 1. Using (1) and (2), we have:
[| ([])|] = ∑︁  () · |  ()| = ∑︁  ()</p>
      <p>· |  ()|
 − 1
=0  +1
︃(  − 1
=1
∑︁( − ) +  − 1
)︃
=</p>
      <p>1</p>
      <p>As a consequence of the above, the expected number of idle bits per character in a SFDC
encoding using  layers can be estimated by the following expression  −
( +3 − 3)/ +1.
a text over an alphabet with Fibonacci frequencies, comparing theoretical values against values
experimentally computed. To obtain the experimental values, we used artificially generated
texts of 100 MB whose character frequency results in a Fibonacci frequency.
=
=
=
 − 1
=0
Fibonacci frequencies. Values are shown for 5 ⩽  ⩽ 8 and  ∈ {10, 20, 30}.</p>
      <p>As the size of the alphabet  increases, it is easy to verify that the function quickly converges
to the value  − 2.618. Indeed, by recalling that lim  / = 1, where  = 21 (1 + √5) is the
golden ratio, by Lemma 1 we have
 →∞
 →∞
lim [| ([])|] = lim
 →∞</p>
      <p>=  + (), where  := ∑︀ =−01  () · |  ()|.</p>
      <p>Consequently, if we assume that the value  represents a constant implementation-related
parameter, the total space used by SFDC for encoding a sequence of  characters is equal to</p>
      <sec id="sec-3-1">
        <title>3.1. Expected Decoding Delay</title>
        <p>We claim that
∑︀ −  − 1  () · (| ()| −  ).</p>
        <p>=0
On the basis of what was shown above, we now estimate the expected value of the decoding
delay, i.e., the number of additional characters that need to be decoded in order to obtain the
full encoding of a character of the text.</p>
        <p>Assume that the -th character of the text is [] = , where 0 ⩽  &lt;  and 0 ⩽  &lt; .
We want to estimate the position  ⩾  at which the last pending bit of the encoding  ([]) is
placed, so that the delay of character [] is  − .</p>
        <p>Let us assume that the number of layers of the representation is equal to  . If the length of the
encoding of [] falls within the number  of layers, namely | ([])| ⩽  , then the character
can be decoded in a single cycle and the delay is 0. On the other hand, if | ([])| &gt;  , then
| ([])| −  bits remain to be read, all arranged within the dynamic layer. Since the length of
the encoding of any text character can be approximated in the average case to the value 2.618,
if we assume to use a fixed number of layers  ⩾ 4, we expect that on average the pending bits
of  ([]) will be placed in the | ([])| −  positions past position . The average delay in this
case can be estimated by | ([])| −  .</p>
        <p>Thus, for an alphabet of  characters with frequency function (3), the expected delay of our
Fibonacci encoding with  layers (where 4 ⩽  ⩽  − 1) can be estimated by the summation:
∑︁  () · (| ()| −  ) =
The average delay in decoding a single character in a text over an alphabet with Fibonacci frequencies
using SFDC. Theoretical values (on the left) compared with experimental values (on the right). Values
are tabulated for 5 ⩽  ⩽ 8 and  ∈ {10, 15, 20, 25, 30}.
∑︁  · (| ()| −  )
∑︁  · ( −  − ) +  −  − 1
10
Fibonacci frequencies using SFDC, comparing values resulting from the identity (4) against
values computed experimentally. To obtain the experimental values, we generated texts of
100 MB whose character frequency is the Fibonacci frequency (3). For each such text , the
average delay value was experimentally obtained by accessing all the characters in random
order, computing the delays, and dividing the sum by the total number of characters in .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future works</title>
      <p>In this paper, we have introduced a data reorganisation technique that, when applied to
variablelength codes, allows easy, direct, and fast access to any character of the encoded text, in time
proportional to the length of its code-word, plus an additional overhead that is constant in
many practical cases. Besides being extremely simple to be translated into a computer program
and eficient in terms of space and time, our method has the surprising feature of being also
computation-friendly. We expect, in fact, that it will turn out to be particularly suitable for
applications in text processing. Our future work will be devoted to demonstrating such a claim.
=
=
=
=</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cantone</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Faro,</surname>
          </string-name>
          <article-title>The many qualities of a new directly accessible compression scheme</article-title>
          ,
          <source>CoRR abs/2303</source>
          .18063 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2303.18063. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.18063. arXiv:
          <volume>2303</volume>
          .
          <fpage>18063</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turpin</surname>
          </string-name>
          , Compression and
          <string-name>
            <given-names>Coding</given-names>
            <surname>Algorithms</surname>
          </string-name>
          , volume
          <volume>669</volume>
          of The international series in engineering and computer science, Kluwer,
          <year>2002</year>
          . URL: http://www.cs.mu.oz.au/ caca/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Salomon</surname>
          </string-name>
          ,
          <article-title>Variable-length Codes for Data Compression</article-title>
          , Springer-Verlag, Berlin, Germany / Heidelberg, Germany / London, UK / etc.,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hufman</surname>
          </string-name>
          ,
          <article-title>A method for the construction of minimum-redundancy codes</article-title>
          ,
          <source>Proceedings of the Institute of Radio Engineers</source>
          <volume>40</volume>
          (
          <year>1952</year>
          )
          <fpage>1098</fpage>
          -
          <lpage>1101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kallick</surname>
          </string-name>
          ,
          <article-title>Generating a canonical prefix encoding</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>7</volume>
          (
          <year>1964</year>
          )
          <fpage>166</fpage>
          -
          <lpage>169</lpage>
          . URL: https://doi.org/10.1145/363958.363991. doi:
          <volume>10</volume>
          .1145/363958.363991.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          ,
          <article-title>A simple storage scheme for strings achieving entropy bounds</article-title>
          ,
          <source>Theor. Comput. Sci</source>
          .
          <volume>372</volume>
          (
          <year>2007</year>
          )
          <fpage>115</fpage>
          -
          <lpage>121</lpage>
          . URL: https://doi.org/10.1016/j.tcs.
          <year>2006</year>
          .
          <volume>12</volume>
          .012. doi:
          <volume>10</volume>
          .1016/j.tcs.
          <year>2006</year>
          .
          <volume>12</volume>
          .012.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          ,
          <article-title>Eficient storage and retrieval by content and address of static files</article-title>
          ,
          <source>J. ACM</source>
          <volume>21</volume>
          (
          <year>1974</year>
          )
          <fpage>246</fpage>
          -
          <lpage>260</lpage>
          . URL: https://doi.org/10.1145/321812.321820. doi:
          <volume>10</volume>
          .1145/321812.321820.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          , L. Stuiver,
          <article-title>Binary interpolative coding for efective index compression</article-title>
          ,
          <source>Inf. Retr</source>
          .
          <volume>3</volume>
          (
          <year>2000</year>
          )
          <fpage>25</fpage>
          -
          <lpage>47</lpage>
          . URL: https://doi.org/10.1023/A:1013002601898. doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1013002601898</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Teuhola</surname>
          </string-name>
          ,
          <article-title>Interpolative coding of integer sequences supporting log-time random access</article-title>
          ,
          <source>Inf. Process. Manag</source>
          .
          <volume>47</volume>
          (
          <year>2011</year>
          )
          <fpage>742</fpage>
          -
          <lpage>761</lpage>
          . URL: https://doi.org/10.1016/j.ipm.
          <year>2010</year>
          .
          <volume>11</volume>
          .006. doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2010</year>
          .
          <volume>11</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Vitter</surname>
          </string-name>
          ,
          <article-title>High-order entropy-compressed text indexes</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms</source>
          , ACM/SIAM,
          <year>2003</year>
          , pp.
          <fpage>841</fpage>
          -
          <lpage>850</lpage>
          . URL: http://dl.acm.org/citation.cfm?id=
          <volume>644108</volume>
          .
          <fpage>644250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brisaboa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ladra</surname>
          </string-name>
          , G. Navarro,
          <article-title>Directly addressable variable-length codes</article-title>
          ,
          <source>in: SPIRE</source>
          <year>2009</year>
          , volume
          <volume>5721</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>130</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>642</fpage>
          -03784-9_
          <fpage>12</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -03784-9\_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N. R.</given-names>
            <surname>Brisaboa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ladra</surname>
          </string-name>
          , G. Navarro, Dacs:
          <article-title>Bringing direct access to variable-length codes</article-title>
          ,
          <source>Inf. Process. Manag</source>
          .
          <volume>49</volume>
          (
          <year>2013</year>
          )
          <fpage>392</fpage>
          -
          <lpage>404</lpage>
          . URL: https://doi.org/10.1016/j.ipm.
          <year>2012</year>
          .
          <volume>08</volume>
          .003. doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2012</year>
          .
          <volume>08</volume>
          .003.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>