1. Introduction

G. Romana)

On the Number of Equal-Letter Runs of the Bijective Burrows-Wheeler Transform

Elena Biagi

Davide Cenzato

Zsuzsanna Lipták

Giuseppe Romana

2 0 Ca' Foscari University of Venice , Italy 1 University of Helsinki , Finland 2 University of Palermo , Italy 3 University of Verona , Italy

2023

000 2022 0 0002

The Bijective Burrows-Wheeler Transform (BBWT) is a variant of the famous BWT [Burrows and Wheeler, 1994]. The BBWT was introduced by Gil and Scott in 2012, and is based on the extended BWT of Mantaci et al. [TCS 2007] and on the Lyndon factorization of the input string. In the original paper, the compression achieved with the BBWT was shown to be competitive with that of the BWT, and it has been gaining interest in recent years. In this work, we present the first study of the number of runs of the BBWT, which is a measure of its compression power. We exhibit an infinite family of strings on which of the string and of its reverse difer by a multiplicative factor of Θ(log ), where is the length of the string. We also present experimental results and statistics on () and (rev), as well as on the number of Lyndon factors in the Lyndon factorization of and rev.

eol>Bijective Burrows-Wheeler Transform BWT eBWT ombinatorics on words Lyndon factorization data compression

1. Introduction The Burrows-Wheeler-Transform (BWT) [1] is a fundamental invertible string transform orig

inally introduced in 1994 by Michael Burrows and David J. Wheeler as a preprocessing step for string compression. The BWT tends to be easier to compress than the original input and supports eficient pattern matching tasks while keeping the transform in compressed form. Due to this, this transform has become the cornerstone of several string compressors and compressed † Funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. data structures [ 2, 3 ] and several commonly used bioinformatics tools such as Bowtie [ 4, 5 ] and BWA [ 6 ].

These properties are due to the so-called clustering efect of the BWT. In fact, if the input text is highly repetitive, the final transform will tend to include few long runs of the same character, the number of which is usually denoted . This motivated the interest for as a parameter to measure text repetitiveness, with several recent papers studying the suitability of as a repetitiveness measure as well as how is related to other such measures [ 7, 8, 9 ]. Since a string and its reverse are repetitive in the same way, we would expect the number of runs of their transformations to be the same. However, Giuliani et al. [ 10 ] showed that the parameter is not invariant w.r.t. string reversal; in fact, there are strings for which () and (rev) difer by a multiplicative factor of Θ(log ), where is the length of the string.

The Bijective Burrows-Wheeler Transform (BBWT) [11, 12] is another invertible transforma

tion, which is a variation of the BWT. It is defined as the extended BWT (eBWT) [ 13 ] of the factors of the Lyndon factorization of the input string. As opposed to the BWT, the BBWT is bijective: no two diferent words have the same BBWT, and every word is the BBWT of some word.

This transformation has met increasing interest in the last decade, with several papers published on this topic. Recently Bannai et al. in [ 14 ] presented the first linear time algorithm for constructing the Bijective BWT, thus unlocking the eficient computation of this transform for large inputs. In [ 15 ], it was further shown how to use in-place algorithms for constructing the BBWT and converting the BWT to the BBWT in quadratic time, thus highlighting a strong connection between these two transforms. Finally, in [ 16 ], Bannai et al. presented the first self-index based on the BBWT supporting eficient pattern-matching queries on the original input, similar to the original BWT. This comes with a small additional log(| |) factor in the backward search algorithm, where is the pattern.

In the original BBWT paper [ 11 ], the authors studied the suitability of this transform for text compression. Their experimental results show that on the Calgary corpus, the compression guaranteed by the BBWT is competitive. In particular, on average, the BBWT was about 1% more compressible than the BWT. The BBWT properties were further studied in subsequent papers, and it has inspired the definition of other bijective BWT variants [ 12, 17, 18 ]. However, the efectiveness of the number of runs of the BBWT as a repetitiveness measure has not been studied before.

In this paper, we present the first results on as a repetitiveness measure, comparing the

behaviour of of a string and of its reverse. We define an infinite family of words for which of the string and its reverse difer by a multiplicative factor of Θ(log ), where is the length of the string, thus proving a parallel result on the BBWT to that of Giuliani et al. [ 10 ] on the

BWT. This result shows the BBWT, as a measure of repetitiveness, exhibits the same defect as

the BWT, namely that reversing the string may change it significantly, while repetitiveness is, of course, preserved. The family of strings used in this paper derive from finite Fibonacci words, as do those of [ 10 ], but the similarities end there; both the strings themselves and the proof techniques employed are diferent.

In the final part of the paper, we present experimental results on , studying the multiplica

tive and additive diference between a string and its reverse, as well as the relationship to the number of factors of its Lyndon factorization.

Due to space constraints, some proofs have been omitted and will be included in the full version of this paper. A preliminary version of some of the results in this paper was contained in the first author’s master thesis [19]. 2. Basics Let Σ be a finite ordered alphabet of size . A string (or word) over Σ is a finite sequence

= 1 · · · of characters from Σ. We denote the length of string as ||. The empty string is the only string of length 0 and is denoted . For ≥ 0, Σ denotes the set of all words of length , and Σ* = ∪≥ 0Σ the set of all finite words over Σ.

Let , , , ∈ Σ* such that = . Then is called a prefix , a substring, and a sufix of . A string is a subsequence of if can be obtained from by deleting some (possibly 0, possibly all) characters from . A prefix (sufix, substring) of a word is a called proper if ̸= . For a string and an integer ≥ 1, = · · · · denotes the -fold concatenation of . A string is called primitive if = implies = and = 1. If is not primitive then it is called a power. Every word can be uniquely written as = for a primitive string , called root of . Given a string , we also define the infinite word = · · · · · We denote the number of maximal equal-letter runs of a string by runs().

The lexicographic order on Σ* is defined as follows: <lex if either is a proper prefix of ,

or if there exists ∈ Σ* and , ∈ Σ, < such that is a prefix of and is a prefix of . Another order relation on Σ* is the omega-order defined as follows: Let = and = ℓ, , primitive. Then < if = and < ℓ, or else if <lex . Note that the lexicographic and the omega-order coincide on strings of the same length, but they can difer if one is a proper prefix of the other, e.g. ab <lex aba but aba < ab.

Two words and ′ are called conjugates (or rotations) if there exists , , possibly empty,

s.t. = and ′ = . Conjugacy is an equivalence relation. We denote the conjugacy class of a word ∈ Σ, as [] = { | and are conjugates}. A word is primitive if and only if its conjugacy class has cardinality ||.

A primitive word is called a Lyndon word if it is lexicographically strictly smaller than all

of its rotations. A necklace is a Lyndon word or a power of a Lyndon word. For a primitive word , we denote by () the unique conjugate which is a Lyndon word (its Lyndon rotation). Every string can be uniquely written as = 1 · 2 · · · such that are Lyndon words for = 1, . . . , , and 1 ≥ 2 ≥ . . . ≥ [20]. This is called ’s Lyndon factorization. A string is Lyndon if and only if its Lyndon factorization consists of one factor only. The multiset of Lyndon factors in the Lyndon factorization of is denoted ℒ().

The Burrows-Wheeler Transform (BWT) [1] of a string is a permutation of the characters of

, whose th character is the last character of the th rotation of , where the rotations are taken w.r.t. lexicographic order. For example, BWT(banana) = nnbaaa, see Fig. 1. The number of runs of the BWT is denoted () = runs(BWT()), e.g. (banana) = 3.

The extended BWT (eBWT) [13] is a generalization of the BWT to a multiset of primitive strings

ℳ: eBWT(ℳ) is a permutation of the characters of the strings in ℳ, whose th character is the last character of the th rotation, where the rotations of all strings in ℳ are listed w.r.t. omega-order. For an example, see Fig. 2. Note that for all strings , eBWT({}) = BWT().

Next we list some known properties of the eBWT.

Lemma 1

1. Let ∈ . Then BWT() is a subsequence of eBWT(). 2. Let be a multiset of primitive strings, and ′ a conjugate of some ∈ . Then the number of runs of eBWT( ∪ {′}) equals the number of runs of eBWT(). 3. Given an integer > 0 and a primitive word , let be the multiset consisting of copies of . Then BWT() = eBWT().

3. The bijective BWT

Let = 1 · 2 · · · be the Lyndon factorization of string . Then BBWT() = eBWT(ℳ), where ℳ = ℒ() = {1, 2, . . . , } is the multiset of Lyndon factors of . As an example, BWT(banana) = nnbaaa, while BBWT(banana) = annbaa, since the Lyndon factorization of banana is b · an · an · a, and thus ℒ(banana) = {a, an, an, b}, see Fig. 1. sorted rotations BWT abanan n anaban n ananab b banana a nabana a nanaba a

Proof First assume that BWT() = BBWT() holds. Let = , where is primitive and ≥ 1, and let be the conjugate of which is a necklace. Then clearly, = with = (), and ℒ() consists of copies of . Thus:

BBWT() d=ef. eBWT(ℒ()) Lem=ma 1 BWT() , conj. BWT() = BBWT().

= By bijectivity of the BBWT, this implies that equals its own necklace rotation , and is thus a necklace.

Conversely, if = is a necklace, then ℒ() consists of copies of , and BBWT() = eBWT(ℒ()) = BWT(), again by Lemma 1. □

Let us denote by () = runs(BBWT()), the number of runs of the BBWT of . It is known that the number of runs () of the BWT of a binary string is always even. This is because necessarily the first character must be b, and the last character must be a. This is not the case of the BBWT since a and b are the smallest and the greatest Lyndon factor, respectively, that a binary word can have, and therefore, BBWT may start and end with either a or b. In the next lemma, we give the conditions that a or b appear as a Lyndon factor.

Lemma 3 Let ∈ {a, b}* be a binary string, with a < b. Then, a ∈ ℒ() if and only if a is the last letter of . Symmetrically, b ∈ ℒ() if and only if b is the first letter of . Proof We prove just the first statement on a, and the other case is treated symmetrically. For the first direction, observe that a is the smallest Lyndon factor (both in lexicographical and order) that any binary word can have. Since the Lyndon factorization requires that any Lyndon factor must be greater or equal than the following in (if any), a ∈ ℒ() implies that either the next letter in is another a, or there are no more letters in . For the other direction, suppose by contradiction that ends with an a and a ∈/ ℒ(). Then, there exist ℓ ≥ 0, ≥ 1, ∈ {a, b}* such that the word = aℓba ∈ ℒ(), that is the Lyndon factor containing the last a of . However, one can verify that aℓ+1b < aℓba, which contradicts that is a Lyndon word. □

We next characterize when () is odd.

Lemma 4 Let ∈ {a, b}* be a binary word. It holds that () is odd if and only if starts and ends with the same letter.

However, it turns out that there is a simple connection between () and (rev), namely that they must have the same parity: Lemma 5 For every binary string , the diference between the number of runs of the BBWT of and the BBWT of its reverse is even.

Proof If starts and ends with the same character, then so does rev, and by Lemma 4, both have an odd number of runs. If starts and ends with diferent characters, then so does rev, and by Lemma 4, both have an even number of runs. □

The BWT of a word achieves maximal compression when it has as many runs as the size of the alphabet. In case of binary alphabets, it was shown in [21] that the perfect clustering efect of the BWT is a characterization for the family of standard Sturmian words. These words can be constructed through a directive sequence, an infinite sequence of integers {}≥ 0 such that 0 ≥ 0 and > 0 for all > 0. The standard Sturmian words generated by this sequence are the words of the sequence {}≥ 0 such that 0 = b, 1 = a, and +1 = − 1 − 1 for > 1. For instance, the directive sequence 1, 1, 1, 1, . . . generates the well-known sequence of ifnite Fibonacci words : 0 = b, 1 = a, 2 = ab, 3 = aba, 4 = abaab, 5 = abaababa, 6 = abaababaabaab, 7 = abaababaabaababaababa, . . .

Theorem 1 ([21]) BWT() = baℓ, for some , ℓ ≥ 1, if and only if is the power of a rotation of a standard Sturmian word.

We give a similar characterization for the BBWT. As opposed to the BWT, here it is not

necessary that the transform begins with b and ends with a.

Lemma 6 The number of runs of the BBWT of a string is 2 if and only if (i) has the form baℓ for some , ℓ ≥ 1, or (ii) is the Lyndon rotation of a standard Sturmian word or a power of the Lyndon rotation of a standard Sturmian word. diference of as () = () − (rev).

In the rest of the paper, we will look at the relationship between () and (rev). To this end, we define the runs-ratio of string as () = max( ((re)v) , ((re)v) ), and the runs

4. Fibonacci words In this section, we will show that the Lyndon rotation of a Fibonacci word of order has the

following interesting property: the number of runs of the BBWT of its reverse has 2( − 2) runs, while the BBWT of the word itself has only 2 runs. Thus, of these words is Θ(log ), where is the length of the word.

Fibonacci words were defined in the previous section. It follows directly from the defini

tion that for all ≥ 0, || = , where {}≥ 0 is the well-known Fibonacci sequence

1, 1, 2, 3, 5, 8, 13, . . . Fibonacci words have been studied extensively, see [22] for an overview. Some of the properties of Fibonacci words also follow from properties that have been shown for all standard words, see e.g. [23, 24, 25, 26, 21]. We next list some of these properties:

Proposition 1 (Some known properties of the Fibonacci words) Let be the Fibonacci word of order ≥ 0. Then there exists a palindrome with the following properties: 1. for all ≥ 2: if is even, then = ab, and if is odd, then = ba, where is a palindrome (in particular, 2 = ). 2. for all ≥ 4, 3. for all ≥ 2, ab is a Lyndon word. 4. for all : ()rev is a conjugate of .

• if is even, then = − 1ba− 2ab = − 2ab− 1ab, and • if is odd, then = − 1ab− 2ba = − 2ba− 1ba. 5. for ≥ 2: BWT() has two runs; in particular, BWT() = b− 2 a− 1 . 6. for ≥ 2: ab and ba, the so-called central words, are adjacent in the BW-matrix of .

From these we can easily deduce further properties. Recall that () is the Lyndon rotation of a primitive string .

Corollary 1 Let be the palindrome from Prop. 1, i.e. = ab for even, and = ba for odd. Then 1. () = ab, 2. = − 1ba− 2 = − 2ab− 1, if even, 3. = − 1ab− 2 = − 2ba− 1, if odd.

Since Lyndon rotations of Fibonacci words will be of central importance, we list the first few

here: (0) = b, (1) = a, (2) = ab, (3) = aab, (4) = aabab, (5) = aabaabab.

Next we study the Lyndon factorization of the reverse of the Lyndon rotation of a Fibonacci

word . We will show that it consists of the Lyndon rotations of for = 0, . . . , − 2, where the words of even order are listed increasingly (w.r.t. their order), followed by those of odd order listed decreasingly, followed by one additional factor (1) = a. Formally: Lemma 7 Let = ()rev be the reverse of the Lyndon rotation of the Fibonacci word of order . Then the Lyndon factorization of is as follows: 1. = (0) · (2) · (4) · · · (− 2) · (− 3) · (− 5) · · · (3) · (1) · (1) if is even, and 2. = (0) · (2) · (4) · · · (− 3) · (− 2) · (− 4) · · · (3) · (1) · (1) if is odd.

Example 1 In the following we list with its Lyndon factorization, for = 2, . . . , 8 : • 2 = (2)rev = b · a, • 3 = (3)rev = b · a · a, • 4 = (4)rev = b · ab · a · a, • 5 = (5)rev = b · ab · aab · a · a, • 6 = (6)rev = b · ab · aabab · aab · a · a, • 7 = (7)rev = b · ab · aabab · aabaabab · aab · a · a, • 8 = (8)rev = b · ab · aabab · aabaababaabab · aabaabab · aab · a · a.

For the proof of Lemma 7, we will first need two other lemmas. The first one gives a simple

recursion for : Lemma 8 Let = (())rev. Then the following recursion holds for ≥ 2 : • = − 2− 1 if is even, and • = − 1− 2 if is odd.

Proof First note that since () = ab, and is a palindrome, therefore = ba. Let ≥ 2 be even. By Corollary 1, = ba = b− 2ab− 1a = − 2− 1. Similarly, if is odd, then = ba = − 1ab− 2 = − 1− 2. □

The second lemma gives a factorization of the Lyndon rotations of the .

Lemma 9 Let ≥ 4. Then the following holds: • () = (− 3)(− 5) · · · (3)(1)(1)(0)(2)(4) · · · (− 2), if is even, and • () = (− 2)(− 4) · · · (3)(1)(1)(0)(2)(4) · · · (− 3), if is odd.

Now we are ready to prove Lemma 7.

Proof (of Lemma 7) The proof is by induction on . For the base cases, 2 = b · a, and 3 = b · a · a, as claimed. Now let > 3. If is even, then = − 2− 1 = b− 2ab− 1a. Thus, = (0)(2) · · · (− 4)(− 5) · · · (3)(1)(1) · ⏟ − 2 by⏞the I.H.

(0)(2) · · · (− 4)(− 3) · · · (3)(1)(1) ⏟ − 1 by⏞the I.H. = (0)(2) · · · (− 4)· (− 5) · · · (3)(1)(1)(0)(2) · · · (− 4) · ⏟ (− 2) b⏞y Lemma 9 (− 3) · · · (3)(1)(1) = (0) · (2) · (4) · · · (− 2) · (− 3) · (− 5) · · · (3) · (1) · (1). The claim for odd follows analogously. □ Example 2 For example, 8 = 6 · 7 = babaababaabaa · babaababaabaababaabaa = b · ab · aabab · aabaababaabab · aabaabab · aab · a · a, where the new Lyndon factor is underlined; its factorization from Lemma 9 is (6) = aabaababaabab = aab · a · a · b · ab · aabab. Corollary 2 (from Lemma 7) Let be the reverse of the Lyndon rotation of the Fibonacci word of order . Then ℒ() = {(0), (1), (2), . . . , (− 2)} ∪{(1)}, i.e. the factor (1) = a appears with multiplicity 2 in the factorization, while all other factors appear exactly once.

The final piece we need to prove the main theorem of this section will be Lemma 10, which

gives the number of runs of the eBWT of the set of Fibonacci words = {0, 1, . . . , }. A crucial role will be played by the central words, which we list up to order 7 in Table 1. Lemma 10 Let > 0 and be the set of Fibonacci words of order up to , i.e. = {0, 1, . . . , }. Then eBWT() has 2 runs.

Proof We prove the claim by induction on . For = 1, 2, eBWT({a, b}) = ab, which has 2 runs, and eBWT({a, b, ab}) = abab, which has 4 runs, as claimed.

Now let > 2. We will observe what happens to eBWT(− 1) when we insert into − 1 and will show that exactly 2 new runs are created, see Fig. 2.

Let eBWT(− 1) be given, and consider what happens when we add . Denote the two central words of order as = ab and = ba. Note that = if is even, and = if is odd. Clearly, < for all . Moreover, it can be proven that for > 2, the following relationship holds between central words of subsequent order: > − 1 if is even, and < − 1 if is odd.

Now, when considering eBWT(− 1), the two words and are inserted together, i.e. they are adjacent in the eBWT-matrix of . This is because they have the common prefix , of length || − 2, which, for > 3, is longer than the longest word previously present (of length |− 1| = − 1 < − 2); for = 3 the claim can be seen by direct inspection (see Fig. 2, left).

Moreover, the two central words will be inserted immediately adjacent to one of the two central words of order − 1. This is because one of the two central words is always equal to ( for even, for odd), and thus, − 1 is a proper prefix of or − 1 is a proper prefix of , and no longer prefix of these words can be present. Thus we have = > − 1 = − 1 if is even, and = < − 1 = − 1 if is odd. Therefore, and will be inserted immediately after − 1 = − 1 if is even, and immediately before − 1 = − 1 if is odd. It follows by induction that − 1 was inserted immediately after (immediately before) − 2 if is even (if is odd). This in turn implies that the word just before (just after) and is − 2, for even (for odd).

Now consider the number of runs of eBWT(− 1). By the induction hypothesis, eBWT(− 1) has 2 − 2 runs. Inserting the two central words creates two new runs. This is because they end in b and a, in this order, and they are inserted between the two previous Fibonacci words: if is even, then they are inserted between − 1, which ends in a, and − 2, which ends in b; if is odd, then between − 2, which ends in a, and − 1, which ends in b.

Note that, since is a standard word, BWT() has the form baℓ, and by Lemma 1, this will be a subsequence of eBWT(). This implies that, if the two central words of order are inserted between, say, position and position + 1 of eBWT(− 1), then all rotations of ending in b will be inserted before , and all rotations ending in a will be inserted after + 1. Inserting a b will create a new run only if it is inserted between two a’s; likewise, inserting an a will create a new run only if it is inserted between two b’s. It is not dificult to prove (by induction) that all a-runs before the position of − 1 have length 1, and that all b-runs after the position of − 1 have length 1.

Thus, exactly two new runs are created. This completes the proof. □ Theorem 2 Let = () be the Lyndon rotation of the th Fibonacci word . Then () = Θ(log ||).

Proof First note that since is a Lyndon word, BBWT() = BWT() by Lemma 2. Since it is a rotation of , BWT() = BWT(). By Prop. 1, BWT() has two runs, so BBWT() has two runs, thus, () = 2.

Now let = ()rev. By Corollary 2, the set of Lyndon factors of is − 2 = {0, . . . , − 2}. It follows from Lemma 1 that the number of runs of a multiset depends only on the set of the elements (and not on the multiplicity of each element). By Lemma 10, the number of runs of eBWT(− 2) is 2( − 2), and therefore, () = 2( − 2).

Finally, using the fact that || = || = grows exponentially in , it follows that () = 2(2− 2) = Θ() = Θ(log ||). □

5. Experimental results In our experiments, we studied the parameter of a string and its reverse, looking at both

multiplicative ( ) and additive diference ( ). We also studied the number of Lyndon factors in the Lyndon factorization of . We considered only those strings which are lexicographically strictly smaller than their reverse. We refer to such strings also as forward strings , as opposed to strings which are strictly larger than their reverse: we refer to these as reverse strings. This is to avoid repeating the same experiment twice (as we compare of a string and of its reverse).

Note that this experimental setup excludes palindromes. We computed the BBWT of all forward strings for lengths between 3 and 25, over a binary

alphabet. We also ran the same experiment over a ternary alphabet, for strings of length up to 15 (data not shown). 5.1. Multiplicative diference in

of a string and its reverse In the first step of our analysis, () was calculated for each forward string . We report the maximum for each length over a binary alphabet in Table 2 1. Our results show that 1The statistics reported in the summary do not include palindrome sequences since their () is always 1. The number of palindromes of length is 2⌈/2⌉. increasing the sequence length , the average runs-ratio as the proportion of sequence pairs having () = 1 decreases. On the other hand, large sequence lengths generate large maximum () values. This suggests that by increasing , we observe more sequences for which the () is very close to 1 and only a small subset for which () is very diferent from (rev), i.e., we observe extremal words with larger () values. In addition, for every up to 21, the most frequent value of () is 1, i.e. probability that () = (rev) is high (see Fig. 3 for = 21).

As for the extremal words, we noticed that several of them are Lyndon rotations of standard words (data not shown). In fact, these strings always have () = 2, and thus they tend to generate large (). In particular, for () ≥ 4 and ≤ 23, all extremal cases are Lyndon rotations of standard words. However, no precise pattern is visible since also non-standard words can be extremal words. On the other hand, if we consider the smallest values, which present an increase in (), for () ≥ 3, the extremal words are the Lyndon rotation of the Fibonacci word of length and its reverse complement. For instance, = 8 is the smallest which allows to obtain () = 3, and the two extremal cases are aabaabab and ababbabb. As for () ≤ 2, the situation appears more complex, since there are several extremal cases that reach the same runs-ratio. 5.2. Number of Lyndon factors and

To conclude our experiment, we computed the number of Lyndon factors for all and put that

in relation to (). In Fig. 4 we note that there are no strings with both a high number of

Lyndon factors and a high number of runs as the top right corners of both graphs for forward

(a) and reverse (b) are empty. The two plots are quite similar, with the exception of the number of Lyndon factors that appear to be higher in reverse strings (see also Fig. 3). In addition, the -axis for forward and reverse is the same, indicating that values of span the same range on both forward and reverse strings.

We believe that the behavior described above might be related to the way we define ; since

is always lexicographically strictly smaller than its reverse, if rev has a run of b at the beginning and a run of a at the end the resulting Lyndon factorization will contain several length-one

Lyndon factors. For instance, the Lyndon factorization of bbbbababaababbaaaa results in

eight length-one Lyndon factors: b · b · b · b · ab · ab · aababb · a · a · a · a. However, also in this case, no clear pattern arose from our experiments. In fact, there are cases where the Lyndon factorization of leads to () which is much smaller than (rev).

On strings over a binary alphabet, we can observe in the right plot in Fig. 4 that many strings

are found on the vertical line indicating 0 as the diference in the number of runs, and the two measures analysed do not seem to strongly correlate. Results were inconclusive also when performing the same analysis counting only distinct Lyndon factors. Further details on the experimental results will be given in the full version of the paper. survey of string orderings and their application to the Burrows-Wheeler transform, Theor.

Comput. Sci. 710 (2018) 52–65. [19] E. Biagi, On comparing the Bijective Burrows-Wheeler-Transform of a word and its reverse,

Master’s thesis, University of Verona, Italy, 2022.

[20] K. T. Chen, R. H. Fox, R. Lyndon, Free diferential calculus, iv. the quotient groups of the lower central series, Annals of Mathematics 68 (1958) 81. [21] S. Mantaci, A. Restivo, M. Sciortino, Burrows-Wheeler transform and Sturmian words, Inf.

Process. Lett. 86 (2003) 241–246. [22] A. de Luca, A combinatorial property of the Fibonacci words, Inf. Process. Lett. 12 (1981) 193–195. [23] A. de Luca, F. Mignosi, Some combinatorial properties of Sturmian words, Theor. Comput.

Sci. 136 (1994) 361–285. [24] A. de Luca, Sturmian words: Structure, combinatorics, and their arithmetics, Theor.

Comput. Sci. 183 (1997) 45–82. [25] J. Berstel, A. de Luca, Sturmian words, Lyndon words and trees, Theor. Comp. Sci. 178 (1997) 171–203. [26] J. Borel, C. Reutenauer, On Christofel classes, RAIRO Theor. Inf. Appl. 40 (2006) 15–27.

[1]

Burrows ,

D. J.

Wheeler , A block-sorting lossless data compression algorithm , Technical Report, DIGITAL System Research Center , 1994 .

[2]

Gagie , G. Navarro,

Prezza , Fully functional sufix trees and optimal text searching in BWT-runs bounded space , ACM 67 ( 2020 ) 2: 1 - 2 : 54 .

[3]

Mäkinen , G. Navarro, Succinct sufix arrays based on run-length encoding , Nord. J. Comput . 12 ( 2005 ) 40 - 66 .

[4]

Langmead ,

Trapnell ,

Pop ,

S. L.

Salzberg , Ultrafast and memory-eficient alignment of short DNA sequences to the human genome , Genome Biology 10 ( 2009 ) R25 . doi: 10 . 1186/gb-2009 -10-3-r25.

[5]

Langmead ,

S. L.

Salzberg , Fast gapped-read alignment with Bowtie 2 , Nature

Methods

9 ( 2012 ) 357 - 359 .

[6]

Li ,

Durbin , Fast and accurate long-read alignment with Burrows-Wheeler transform , Bioinform . 26 ( 2010 ) 589 - 595 .

[7]

Akagi ,

Funakoshi ,

Inenaga , Sensitivity of string compressors and repetitiveness measures , Inf. Comput . 291 ( 2023 ) 104999 .

[8]

Giuliani ,

Inenaga , Zs. Lipták, G. Romana,

Sciortino ,

Urbina , Bit catastrophes for the Burrows-Wheeler Transform , in: Proc. of Developments in Language Theory - 27th International Conference, DLT 2023 , volume 13911 of Lecture Notes in Computer Science, LNCS, 2023 , pp. 86 - 99 .

[9]

Navarro , Indexing highly repetitive string collections, part I: repetitiveness measures , ACM Comput. Surv . 54 ( 2022 ) 29 : 1 - 29 : 31 .

[10]

Giuliani ,

Inenaga , Zs. Lipták,

Prezza ,

Sciortino ,

Tofanello , Novel results on the number of runs of the Burrows-Wheeler-Transform , in: Proc. of 47th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2021 , volume 12607 of Lecture Notes in Computer Science, LNCS, 2021 , pp. 249 - 262 .

[11]

J. Y.

Gil ,

D. A.

Scott , A bijective string sorting transform , CoRR abs/1201 .3077 ( 2012 ).

[12]

Kufleitner , On bijective variants of the Burrows-Wheeler Transform , in: Proc. of the Prague Stringology Conference 2009 , 2009 , pp. 65 - 79 .

[13]

Mantaci ,

Restivo , G. Rosone,

Sciortino , An extension of the Burrows-Wheeler Transform , Theor. Comput. Sci . 387 ( 2007 ) 298 - 312 .

[14]

Bannai ,

Kärkkäinen ,

Köppl , M. Piatkowski, Constructing the Bijective and the Extended Burrows-Wheeler Transform in linear time , in: Proc. of 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021 , volume 191 of LIPIcs, 2021 , pp. 7 : 1 - 7 : 16 .

[15]

Köppl ,

Hashimoto ,

Hendrian ,

Shinohara , In-place Bijective Burrows-Wheeler Transforms, in: Proc. of 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020 , volume 161 of LIPIcs, 2020 , pp. 21 : 1 - 21 : 15 .

[16]

Bannai ,

Kärkkäinen ,

Köppl , M. Piatkowski, Indexing the Bijective BWT , in: Proc. of 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019 , volume 128 of LIPIcs, 2019 , pp. 17 : 1 - 17 : 14 .

[17]

J. W.

Daykin ,

W. F.

Smyth , A bijective variant of the Burrows-Wheeler Transform using V-order , Theor. Comput. Sci . 531 ( 2014 ) 77 - 89 .

[18]

J. W.

Daykin ,

Groult ,

Guesnet ,

Lecroq ,

Lefebvre ,

Léonard , É. Prieur-Gaston , A