=Paper=
{{Paper
|id=Vol-3284/4242
|storemode=property
|title=When a Dollar in a Fully Clustered Word Makes a BWT
|pdfUrl=https://ceur-ws.org/Vol-3284/4242.pdf
|volume=Vol-3284
|authors=Sara Giuliani,Zsuzsanna LiptΓ k,Francesco Masillo
|dblpUrl=https://dblp.org/rec/conf/ictcs/GiulianiLM22
}}
==When a Dollar in a Fully Clustered Word Makes a BWT==
When a Dollar in a Fully Clustered Word Makes a BWT Sara Giuliani1,* , Zsuzsanna LiptΓ‘k1 and Francesco Masillo1 1 University of Verona, Italy Abstract The Burrows-Wheeler Transform (BWT) is a powerful transform widely used in string compression and string processing. It produces a reversible permutation of the characters of the input string, often allowing for easier compression of the string, while also enabling fast pattern matching. While the BWT is defined for every word, not every word is the BWT of some word. A characterization of BWT images was given in [Likhomanov and Shur, CSR, 2011], based on the standard permutation of the string. Often an end-of-file character $ is added to mark the end of a string. Given a string π€, it is an interesting combinatorial question where a $ can be inserted to make π€ the BWT of some string π£$. This question was answered in [Giuliani et al. Theor. Comput. Sci. 2021], where an efficient algorithm was presented for computing all such positions (called nice positions), and a characterization of nice positions was given, based on pseudo-cycles in the standard permutation of π€. In this paper, we give a stronger characterization of nice positions: We show that these can be characterized using only essential pseudo-cycles, which constitute a small subset of all possible pseudo- cycles. We present an algorithm to compute all essential pseudo-cycles of a word π€. In the second part of the paper, we study nice positions of fully clustered words: these are words π€ whose number of runs equals the number of distinct characters occurring in π€. Fully clustered words are of particular interest due to their extreme compressibility, and words whose BWT is fully clustered have been studied extensively. We are interested in the number of nice positions of fully clustered words, as well as in the number of fully clustered words with π nice positions, for fixed π and word length. Keywords combinatorics on words, Burrows-Wheeler-Transform, permutations, string algorithms, fully clustered words Proceedings of the 23rd Italian Conference on Theoretical Computer Science, Rome, Italy, September 7-9, 2022 * Corresponding author. " sara.giuliani_01@univr.it (S. Giuliani); zsuzsanna.liptak@univr.it (Zs. LiptΓ‘k); francesco.masillo@univr.it (F. Masillo) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction The Burrows-Wheeler Transform (BWT) [1] is a widely used reversible transformation on strings, which is central in the context of lossless compression and string processing. It is a permutation of the input word that tends to group together equal characters. Usually, the more repetitive the string, the longer the groups of consecutive equal characters, called runs. This is known as clustering effect [2], and allows for easier compression of the input [3, 4]. Due to its power with very repetitive strings and to its reversibility, the BWT is fundamental in fields where large amounts of textual data need to be processed, such as bioinformatics, where BWT-based tools are among those most frequently used [5, 6, 7, 8]. Understanding the underlying combinatorics is crucial in applications, since they are necessary for advancements on data structures based on the BWT, such as the FM-index [9] or the more recent π-index [10]. The BWT has also been studied intensively from a combinatorics on words point of view, with [11] showing that it was a special case of the Gessel-Reutenauer bijection [12]. Many papers examined the clustering effect of the BWT, most recently [13, 14, 15, 16], while other transforms with similar properties were investigated in [17, 18, 19]. While the BWT is defined for every word, not every word is the BWT of some word; characterizations of BWT images are known [20, 21]. In this paper, we continue the study of the following combinatorial problem, introduced in [22]: Given a word π€, where, if at all, can a sentinel character $ be inserted into π€ to make it the BWT of some word ending with $? We call such positions nice positions. The question is motivated by the fact that in the application context, often such a sentinel character $ is appended to mark the end of the input string. In [22], an πͺ(π log π)-time algorithm for computing all nice positions was given, as well as a characterization using so-called pseudo-cycles. In this paper, we give a stronger characterization of nice positions, restricting the set of pseudo-cycles to a subset which can be exponentially smaller than the complete set. We also present an πͺ(π2 )-time algorithm that computes nice positions via this restricted set of pseudo-cycles. Although the new algorithm is slower than the previous one in the worst case, it gives more information in output, thus opening up new possibilities of insights into the problem. We then apply this algorithm to the study of nice positions in fully clustered words. These are words whose number of equal-letter-runs equals the number of distinct characters, such as π€ = bbbbcaa. Words whose BWT is fully clustered are of special interest because of their high compressibility via the BWT; e.g. such a word is abbacbb, with BWT π€. Over a binary alphabet, it is known that a word has fully clustered BWT if and only if it is conjugate of a standard word or of a power of a standard word [20]. Over a ternary alphabet, there exists a characterization via morphisms, of words whose BWT has the form cπ bπ aπ (there called Type I) [23], while a combinatorial property (circular palindromic richness) was shown to be a necessary but not sufficient condition for such words [24, 25]. This result extends to larger alphabets, i.e. ππβ1 words whose BWT has the form ππππ ππβ1 . . . ππ1 1 , with alphabet Ξ£ = {π1 , π2 , . . . , ππ } and π1 < π2 < . . . < ππ . Finally, a characterization of words with fully clustered BWT, over arbitrary alphabets, was given in terms of discrete interval exchanges in [26]. We present both theoretical and experimental results on nice positions in fully clustered words, as well as on the number of words with π nice positions, for fixed π. We studied binary and ternary alphabets. Due to space restrictions, we present only the results on the binary alphabet, as well as having to omit some proofs. Both will be included in the full version. Overview of paper: In Section 2, we introduce the necessary notation and basic facts. In Section 3, we give the new stronger characterization of nice positions, and in Section 3.1, we describe the algorithm exploiting the new characterization. In Section 4, we present some properties and conjectures related to nice positions for fully clustered words for the unary and binary alphabets. We close with a brief outline of future work in Section 5. 2. Preliminaries Strings and permutations. Given a finite ordered alphabet Ξ£, a string (or word) over Ξ£ is a finite sequence π€ = π€0 Β· Β· Β· π€πβ1 of characters from Ξ£. We denote by π the size of Ξ£, by |π€| = π the length of a word π€0 . . . π€πβ1 , and by π the empty word, the only string of length 0. Note that we index strings from 0. We denote by alph(π€) = {π | there exists 1 β€ π β€ |π€| : π€π = π} the set of characters occurring in π€. For two words π’, π£, π’π£ denotes their concatenation, given by π’π£ = π’0 Β· Β· Β· π’|π’|β1 π£0 Β· Β· Β· π£|π£|β1 . If π€ = π₯π£π¦, then π₯ is called a prefix, π£ a substring, and π¦ a suffix of π€. Given a word π’ and a positive integer π, π’π = π’ Β· Β· Β· π’ denotes the π-fold concatenation of π’. A word π£ is called a power if it can be written as π£ = π’π for some π > 1, otherwise it is called primitive. Often a sentinel character, denoted $, is appended to mark the end of a string, where $ does not occur elsewhere in the string, and is assumed to be smaller than all π β Ξ£. Clearly, every word of the form π£$ is primitive. Two words π€, π€β² are called conjugates (or rotations) if there exist π’, π£, possibly empty, such that π€ = π’π£ and π€β² = π£π’. Given a word π€ = π€0 Β· Β· Β· π€πβ1 , the πβth rotation of π€, for 0 β€ π β€ πβ1, is π€π Β· Β· Β· π€πβ1 π€0 Β· Β· Β· π€πβ1 . A word of length π is primitive if and only if it has π distinct conjugates. The set of all words over Ξ£ is totally ordered by the lexicographic order: Let π£, π€ β Ξ£* , then π£ β€ π€ if π£ is a prefix of π€, or there exists an index π s.t. for all π < π, π£π = π€π , and π£π < π€π according to the order on Ξ£. A run in a word π€ is a maximal substring consisting of the same character. A fully clustered ππβ1 word is a word π€ with |alph(π€)| runs. Given a word π€ = ππ00 ππ11 Β· Β· Β· ππβ1 with π runs, we define its pattern as pat(π€) = π0 π1 Β· Β· Β· ππβ1 . For example, the fully clustered word π€ = bbccccca has pattern pat(π€) = bca. Let π be a positive integer. A permutation π of π is a bijection from the set(οΈ{0, . . . , π β 1} to it-)οΈ 0 1 ... πβ1 self. A common way to represent permutations is the two-line notation: π = π(0) π(1) ... π(πβ1) . (Note that we index permutations from 0.) A cycle of π is a minimal subset πΆ of the elements of the permutation such that π(πΆ) = πΆ. A cycle of length 1 is called a fixpoint, one of length 2 is called a transposition. It is a basic result on permutations that every permutation can be uniquely decomposed into disjoint cycles, giving rise to(οΈanother common representation of permutations, the cycle representation. For example, π = 2 3 0 5 4 1 has cycle representation 0 1 2 3 4 5 )οΈ (0 2)(1 3 5)(4), where (0 2) is a transposition, (1 3 5) is a cycle of length 3, and (4) is a fixpoint. A permutation that consists of one cycle only is called cyclic. For more details on permutations, see [27]. Given a word π€, the standard permutation of π€, denoted ππ€ , is the permutation defined by: ππ€ (π) < ππ€ (π) if and only if either (οΈ 0 1 2π€3π 4<5 )οΈπ€π , or π€π = π€π and π < π. For example, the standard permutation of banana is 3 0 4 1 5 2 = (0 3 1)(2 4 5). Burrows-Wheeler-Transform. Given a word π£ of length π, the Burrows-Wheeler-Transform (BWT) [1] of π£ is the concatenation of the final characters of its rotations in lexicographic order. This is often visualized using the so-called BW-matrix, consisting of the lexicographically sorted rotations of π£: bwt(π£) is the last column of this matrix, read from top to bottom. For example, the BWT of banana is nnbaaa. It follows from the definition that two words π’, π£ have the same BWT if and only if they are conjugates. While the BWT is defined for every word, not every word π€ is the BWT of some word. Likhomanov and Shur have shown that a word is the BWT of some word π£ if and only if the number of cycles of its standard permutation equals the greatest common divisor of the runlengths of π€ [21] (see also [20, 23] for earlier, more(οΈ restricted)οΈ versions of this statement). For example, the standard permutation of nnbaaa is 04 15 23 30 41 52 = (0 4 1 5 2 3). It has one cycle and this equals gcd(2, 1, 3); indeed, nnbaaa is the BWT of the word banana. Conversely, bananaβs standard permutation has two cycles, but the gcd of its runlengths is 1, and therefore, it is not the BWT of any word. Nice positions and pseudo-cycles. When is a word the BWT of some word of the form π£$, where $ is the sentinel character? It is clear that such a word must have exactly one occurrence of $. Thus, it follows from the result of Likhomanov and Shur that its standard permutation must be cyclic. The following question was studied in [22]: Given a word π€, where can $ be inserted into π€ such that the resulting word is the BWT of some π£$? Such positions are called nice, formally: Given π€ = π€0 Β· Β· Β· π€πβ1 , a position π is nice, 0 β€ π β€ π, if the (π + 1)-length word π€β² is the BWT of some word of the form π£$, where π€πβ² = π€π for π < π, π€πβ² = $, and π€πβ² = π€πβ1 for π β₯ π. For example, the word nnbaaa has two nice positions, 1 and 3: n$nbaaa = bwt(abanan$) and nnb$aaa = bwt(anaban$), while banana has none: nowhere can a $ be inserted to make it the BWT of some word. Let π be a permutation of π. A non-empty subset π β {0, 1, . . . , πβ1} is called a pseudo-cycle if it can be partitioned into two sets πleft and πright , where πleft < πright (elementwise), such that π(π) = {π₯ β 1 | π₯ β πleft } βͺ πright . The critical interval of pseudo-cycle π is π π = [π + 1, π], where π = max πleft and π = min πright . If πleft is empty, we set π = β1, and if πright is empty, we set π = π. An example of a pseudo-cycle is given in Fig. 1. When $ is inserted in the critical interval, π becomes a cycle, hence the resulting standard permutation is not cyclic. The following characterization of nice positions was given in [22]: Theorem 1 (Thm. 6 in [22]). Let π€ be a word of length π over Ξ£, and 0 β€ π β€ π. Then π is nice if and only if there is no pseudo-cycle π w.r.t. the standard permutation ππ€ whose critical interval contains π. We will denote by π π€ the critical set of π€, the union of the critical intervals of all pseudo- cycles. By Thm. 1 then π is nice if and only if π ΜΈβ π π€ . We will further say that pseudo-cycle π blocks position π if π β π π . We will need one more result from [22]. Recall that here we index permutations from 0, while in [22], they are indexed from 1. Lemma 1 (Thm. 5 in [22]). Let π€ be the BWT of some word, and π the number of cycles of ππ€ . Then π is a nice position. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 c b c c a $ b a a c b c c a b a a a a a b b c c c a a a c c c $ b b 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 (οΈ 0 1 2 3 4 5 6 7 )οΈ Figure 1: Standard permutation of (οΈπ€ = cbccabaa)οΈ with π(π€) = 5 3 6 7 0 4 1 2 (on the left) and of π€β² = cbcca$baa with π(π€β² ) = 06 14 27 38 41 50 65 72 83 (on the right). In particular, the pseudo-cycle π = {3, 7} of π (with πleft in green, πright in blue) and its critical interval π = [4, 7] (in red) are shown. Inserting $ in position 5 in the word transforms the pseudo-cycle of π(π€) in a cycle in π(π€β² ). 3. New Results on Pseudo-cycles In this section we will give a stricter characterization of nice positions via a subset of all possible pseudo-cycles. We need some new definitions. Definition 1. Let π be a pseudo-cycle, π = πleft βͺ πright . If πleft = β then π is called right-only, if πright = β then π is called left-only. If πleft ΜΈ= β , then we refer to π = max πleft as the boundary of π. A right-only pseudo-cycle is called minimal if no proper subset is a cycle. Example 1. Consider the word cbccabaa. The pseudo-cycle π = {1, 2, 3, 6, 7}, with π(π) = {1, 2, 3, 6, 7} is a right-only pseudo-cycle with critical interval π π = [0, 1], while π β² = {5}, with π(π β² ) = {4} is a left-only pseudo-cycle with with boundary 4 and critical interval π π β² = [6, 8]. Finally, π β²β² = {2, 3, 6, 7}, with π(π β²β² ) = {1, 2, 6, 7} has boundary 3 and critical interval π π β²β² = β²β² = {2, 3} and π β²β² [4, 6]. This last pseudo-cycle can be divided in πleft right = {6, 7}. Lemma 2. Let π and π β² be two pseudo-cycles with boundary π. Then π β© π β² is also a pseudo-cycle with boundary π. Proof. If π β π β² or π β² β π, then the statement is trivial. Else, we have to show that (1) π β π β© π β² , and (2) if π₯ β π β© π β² then π₯ β 1 β π(π β© π β² ) if π₯ β€ π, and π₯ β π(π β© π β² ) if π₯ > π. (1) follows from the fact that both π and π β² have boundary π. Ad (2): Let π₯ β π β© π β² , π₯ β€ π. Since π is a pseudo-cycle with boundary π, it follows that π β1 (π₯ β 1) β π. Similarly, since π β² is a pseudo-cycle with boundary π, π β1 (π₯ β 1) β π β² , thus π β1 (π₯ β 1) β π β© π β² . Now let π₯ β π β© π β² , π₯ > π. Then it follows that π β1 (π₯) β π and π β1 (π₯) β π β² , thus π β1 (π₯) β π β© π β² . Note that not every π is the boundary of some pseudo-cycle. But with Lemma 2, we can now define a unique pseudo-cycle for each π which is. Definition 2. Let 0 β€ π β€ π β 1. A pseudo-cycle π = πleft βͺ πright , with πleft ΜΈ= β is called π-essential if π is the boundary of π and every pseudo-cycle π β² with boundary π is a superset of π. A pseudo-cycle is called essential if it is π-essential for some π. Clearly, for every π, there exists at most one π-essential pseudo-cycle. This implies that altogether there are at most π β 2 essential pseudo-cycles. Example 2. Consider again the word cbccabaa. π = {3, 7}, with π(π) = {2, 7} is the 3- essential pseudo-cycle with critical interval π = [4, 7]. In this pseudo-cycle we have πleft = {3} and πright = {7}. Proposition 1. Given an π-essential pseudo-cycle π, its critical interval π π is maximal w.r.t. the boundary π. In other words, if π β² is a pseudo-cycle with the same boundary π, then π π β² β π π . Proof. Follows immediately from Lemma 2. Corollary 1. Let π€ be a word. Then π π€ equals the union of critical intervals of those pseudo-cycles which are either minimal right-only or essential. 3.1. Algorithm In this section we will describe an alternative algorithm for finding all nice positions of a given word π€. Even though this algorithm is slower than the one in [22], it is of interest due to its greater informativeness w.r.t. pseudo-cycles. The algorithm takes advantage of Corollary 1. This is useful from a computational point of view, as can be deduced from the next lemma. Lemma 3. The set of essential pseudo-cycles can be exponentially smaller than the complete set of pseudo-cycles. Proof. Let π€ = bβπ/2β aβπ/2β , thus |π€| = π. The total number of pseudo-cycles is 2|π€|π = 2βπ/2β , since every subset of the index set of π€ containing at least one a and excluding the first b is a pseudo-cycle with πleft and πright not empty. On the other hand, there are only βπ/2β essential pseudo-cycles, namely sets of the form π = {π, βπ/2β + π} for π β {1, 2, 3, . . . , βπ/2β}. From Cor. 1 we know that it suffices to compute right-only pseudo-cycles and essential pseudo-cycles to identify nice positions of a word. However, our experiments show that on 98% of words with 0 nice positions, all positions are blocked by left-only and right-only pseudo-cycles (data not shown). Based on experimental evidence from [22] we know that a large fraction of words have 0 nice positions: e.g. for binary words of length π = 20, around 65% of words have 0 nice positions, while over a ternary alphabet, 61% of words of length 20 have 0 nice positions. Since right-only and left-only pseudo-cycles can be computed in linear time, our algorithm does this as a first step (Phase 1 and Phase 2). Only if at this point there are still positions which have not been blocked does it proceed to computing essential pseudo-cycles (Phase 3). We next give a description of the three phases of the algorithm. Phase 1 consists in computing the cycle decomposition of ππ€ , since the cycles of the permu- tation are exactly the minimal right-only pseudo-cycles. Phase 2 starts with a filtering step: It removes from the index set all indices which cannot be part of any left-only pseudo-cycle. The idea is that 0 cannot be part of a left-only pseudo-cycle because, since then β1 would be in the image of the pseudo-cycle. This cannot happen because both the pseudo-cycle and its image must be a subset of the set of indices {0, 1, . . . , π β 1}. By using π on 0 we have the information that π(0) + 1 must not be in the pseudo-cycle itself, because otherwise it would imply also the previous statement. If we continue to apply π iteratively we can block every index connected to 0, avoiding them in the following operations. Algorithm 1: Pseudo-cycles finder Input: Word π€ Output: Set of right-only, left-only, and essential pseudo-cycles 1 π β |π€| 2 π β π π‘ππ πππ(π€) 3 π β1 β πππ£π πππ(π) 4 setRight β πππβπ‘π πΆπ¦ππππ (π) // cycle decomposition 5 setLeft β πππ π‘π πΆπ¦ππππ (π, π β1 ) // compute left-only pseudo-cycles 6 if |π π€ | = π + 1 then // if π π€ contains all indices, exit 7 return setRight βͺ setLeft 8 setEss β ππ π π πΆπ¦ππππ (π, π β1 ) // compute π-essential pseudo-cycles 9 return setRight βͺ setLeft βͺ setEss With the remaining set of indices the left-only pseudo-cycles are built. We use a procedure resembling the cycle decomposition of the standard permutation. Starting from position π, we traverse the permutation and we insert π(π) + 1 in π. From this index we continue until we loop back to the starting index π, namely we build a βshiftedβ cycle. This procedure is repeated until no indices are left to create disjoint left-only pseudo-cycles. Phase 3 computes π-essential pseudo-cycles which have a non-empty right part, since essen- tial pseudo-cycles which are left-only have already been computed in Phase 2. Let π΅ be the set of indices for which such a pseudo-cycle has already been computed. For each π > 0, π ΜΈβ π΅, the algorithm computes the unique π-essential pseudo-cycle, if it exists. During this step, the algorithm traverses the permutation iteratively starting from π = π β1 (π β 1) and using this trajectory: β§ βͺabort βͺ if π = 0 βͺ β¨π β1 (π β 1) if π < i βͺ πβ (1) βͺ βͺ βͺ β1 π (π) if π > i β©stop - ps.-c. closed if π = i βͺ The algorithm inserts the indices that it touches either in πleft (if the index is on the left of the boundary) or in πright (if the index is on the right of the boundary). It stops either because it returns to the starting point (boundary) during the traversal of the permutation or because it fails the construction of such pseudo-cycle, ending up in position 0 (which cannot be part of πleft , otherwise we would have -1 in π(π)). An example of the construction of π-essential pseudo-cycles can be seen in Figure 2. Proposition 2. Algorithm 1 outputs exactly the essential pseudo-cycles and the minimal right-only pseudo-cycles of the input word. Therefore, the algorithm computes all pseudo-cycles which are necessary to identify all nice positions. Proposition 3. Algorithm 1 runs in πͺ(π2 ) time. The following example shows that this bound is tight: In fact, there is an infinite number of words on which the algorithm takes Ξ(π2 ) time. The reason is that the running time of Phase 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 b d c a e b d c b d c a e b d c a b b c c d d e a b b c c d d e 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 b d c a e b d c b d c a e b d c a b b c c d d e a b b c c d d e 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Figure 2: An example for the construction of the π-essential pseudo-cycle of the word bdcaebdc with boundary π = 3 (Section 3.1). πleft in green, πright in blue, critical interval in red. 3 is proportional to the sum of the lengths of the essential pseudo-cycles of the input word. E.g. for π = 2 and π = 20, consider π€ = bbaaaaaaaaabbbbbbbba: the sum of the sizes of the essential pseudo-cycles of π€ is 302. We can produce such example words for every even length by adding an a and a b, respectively, to the first run of aβs and to the second run of bβs. 4. On Nice Positions in Fully Clustered Words In this section, we present properties of nice positions in fully clustered words. Recall that a fully clustered word is one which has one run per character, such as cccabb. We will study these questions for different alphabets, where we restrict our attention to words π€ s.t. alph(π€) = Ξ£, i.e. words in which every character appears at least once. We are interested in the number β(π€) of nice positions of a word π€, and in the number of words π»π (π) which have π nice positions, for given length π. As an example, for π = 2, there are 10 fully clustered words of length 6, 6 of which have 1 nice position, 2 have 2 nice positions, and 2 have 3 (see Table 1). Of the 60 fully clustered words for π = 3, 5 have 0 nice positions, 26 have 1, 16 have 2, and 13 have 3 (data not shown). First we gather some useful properties using pseudo-cycles: Lemma 4. Let π€ be any word and ππ€ its standard permutation. The following hold: 1. If π β 1 is a fixpoint, then no π < π is nice. word β(π€) nice positions word β(π€) nice positions abbbbb 1 6 baaaaa 1 1 aabbbb 1 6 bbaaaa 3 2, 4, 6 aaabbb 1 6 bbbaaa 2 3, 5 aaaabb 1 6 bbbbaa 2 2, 4 aaaaab 1 6 bbbbba 3 1, 3, 5 Table 1 Nice positions of fully clustered binary words of length 6. We report the number of nice positions and the nice positions for each word. 2. If π(1) = 0, then no π > 1 is nice. 3. π is nice if and only if ππ€ has no left-only pseudo-cycle. 4. 0 is not nice. Proof. 1. If (π β 1) is a fixpoint, then {π β 1} is a right-only pseudo-cycle, so it blocks all elements from 0 to π β 1. 2. If π(1) = 0, then {1} is a left-only pseudo-cycle, blocking any π > 1. 3. If π is nice, then it cannot be in the critical interval of any pseudo-cycle. But the pseudo-cycles whose critical interval contains π are exactly the left-only pseudo-cycles, since if there is a non-empty right part, then its minimum must be smaller than π. 4. The entire set {0, . . . , π β 1} is a right-only pseudo-cycle, so its minimum 0 blocks all π β€ 0. Lemma 5. Let π€ be any word, |π€| = π. If ππ€ is the identity permutation, then β(π€) = 1. In particular, π is the only nice position. Proof. ππ€ = (0)(1) Β· Β· Β· (π β 1) (in cycle representation), consisting of π fixpoints. Since π β 1 is a fixpoint, by Lemma 4, no π < π can be nice. Clearly, ππ€ has no left-only pseudo-cycles, so again by Lemma 4, π is nice. Since the standard permutation of any word over a unary alphabet is the identity permutation, these words have exactly one nice position: Corollary 2. If π€ is a word over an alphabet of size π = 1, then β(π€) = 1. 4.1. Number of Nice Positions over Alphabet Size 2 Our first result says that every fully clustered word over a binary alphabet has at least one nice position. Proposition 4. If π€ begins with an a, then β(π€) = 1. If π€ begins with a b then β(π€) β₯ 1. Proof. First, let π€ = aπ bπβπ , for some 1 β€ π β€ π β 1. Then ππ€ = ππ, thus, by Lemma 5, there is exactly one nice position, namely π. Now let π€ = bπ aπβπ , for some 1 β€ π β€ π β 1. It is known that a word of this form is the BWT of a standard word, or of a power of a standard word [20], with the first case if gcd(π, π β π) = 1. By the result of Likhomanov and Shur [21], the number of cycles π of ππ€ equals the greatest common divisor of the runlengths, namely π = gcd(π, π β π). By Lemma 1 then, π is nice. Proposition 5. Let π€ = baπβ1 . Then β(π€) = 1. Proof. The standard permutation of π€ is ππ€ = (0, π β 1, π β 2, . . . , 2, 1), in cycle notation. By Lemma 1, 1 is nice, since ππ€ has one cycle. 0 is not nice by Lemma 4. Every π > 0 is a singleton left-only pseudo-cycle, since π(π) = π β 1, while all other pseudo-cycles are subsets of {1, . . . , π β 1}, and thus left-only. Every left-only pseudo-cycle {π} is π-essential, blocking [π + 1, π], i.e. the only position which is nice is 1 (Fig. 3). 0 1 2 3 4 5 6 0 1 2 3 4 5 6 b a a a a a a b a a a a a a a a a a a a b a a a a a a b 0 1 2 3 4 5 6 0 1 2 3 4 5 6 (a) π1 = {0, 1, 2, 3, 4, 5, 6}, (b) π2 = {1}, π(π2 ) = {0}, π 2 = [2, 7], π(π1 ) = {0, 1, 2, 3, 4, 5, 6}, π 1 = [0, 0]. boundary 1. Figure 3: Two pseudo-cycles π1 , π2 in the word baaaaaa and their critical intervals π 1 , π 2 , which block all positions of the word except for 1. (πleft in green, πright in blue, critical intervals in red.) Proposition 6. Let π€ = bπ aπβπ where π = gcd(π, π β π) > 1. Then π€ has at least two nice positions, namely π and π + 2. Proposition 7. Let π€ = bπ+1 aπ . Then β(π€) = 2, in particular, 1 and π are nice. Proof. First, since π€ is a BWT, the number of cycles π of ππ€ is nice, by Lemma 1. Since the runlengths π and π + 1 are relatively prime, π = 1, so 1 is nice. Let 0 < π β€ π. The π-essential pseudo-cycle is ππ = {π, π + π}, since π(π) = π + π and π(π) = π + π β (π + 1) = π β 1. The critical interval of ππ is π π = [π+1, π+π]. The union of these critical intervals is [2, πβ1] β π π€ , since π = |π€| = 2π + 1. Finally, no π-essential pseudo-cycles exist for π > π. Therefore, there are no left-only pseudo-cycles, and thus, by Lemma 4, π is nice (see Figure 4). 4.2. Number of Words with π Nice Positions over Alphabet Size 2 We now turn to the number of words with π nice positions. Definition 3. For π > 0, π β₯ 0, let π»ππ (π) denote the number of fully clustered words with exactly π nice positions over an alphabet of size π. From Prop. 4 it follows that π»π2 (0) = 0, and from Prop. 4 and 5 it follows that π»π2 (1) β₯ π, since the π β 1 words beginning with a and the word baπβ1 have 1 nice position. For a better understanding of the words with π nice positions, we ran experiments on fully clustered words up to length 100, and studied their pseudo-cycles. These led to the following conjectures: 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 b b b b a a a b b b b a a a b b b b a a a a a a b b b b a a a b b b b a a a b b b b 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 (a) π1 = {1, 4}, (b) π2 = {2, 5}, (c) π3 = {3, 6}, π(π1 ) = {0, 4}, π 1 = [2, 4], π(π2 ) = {1, 5}, π 2 = [3, 5], π(π3 ) = {2, 6}, π 3 = [4, 6], boundary 1. boundary 2. boundary 3. Figure 4: Three pseudo-cycles π1 , π2 , π3 of the same form in the word bbbbaaa, and their critical intervals π 1 , π 2 , π 3 . (πleft in green, πright in blue, critical intervals in red.) Conjecture 1. For all π, π»π2 (1) = π. Conjecture 2. For π even, π β₯ 8, π»π2 (2) = 0. For π odd, π»π2 (2) = 1. Conjecture 3. For every π, π»π2 (β π2 β) = 2. The two words with βπ/2β nice positions are bbaπβ2 and bπβ1 a. This is because both have βπ/2β pseudo-cycles whose critical intervals contain exactly one position each. From [22], we know that for words that are BWT images, the parity of nice positions is the same as the parity of the number of cycles. In these two cases, the pseudo-cycles block every position which have the opposite parity, leaving all the remaining positions available. All our conjectures are confirmed by the histograms we produced for words up to length 100 and π nice positions, π up to 19 (data not shown). Furthermore, the words show a regular behaviour with distinct πβs. In particular, we can see that the 8 longest words (4 of even length, 4 of odd length) are disconnected from all the others. Moreover, the greater π, the farther this group of 8 words is from the others. The first π where this group is visible is π = 6, and we identify the 4 words of even length as follows: two primitive words (π€1 = b7 a13 , π€2 = b15 a7 ), two power words (π€3 = b14 a6 , π€4 = b8 a14 ); and the 4 primitive words of odd length (π€5 = b8 a15 , π€6 = b16 a7 , π€7 = b9 a16 , π€8 = b17 a8 ). These words give an idea of the regularity of π»π2 (π) across different π. Consider π€1 = b7 a13 , and the same word with 4 more aβs and 2 more bβs, namely π€1β² = b9 a17 . In both cases there are |π€1 |b β1 number of pseudo-cycles which have the same form, and starting from the one with the smallest boundary, they are shifted by one position to the right. Moreover, the critical interval of each of these pseudo-cycles has length |π€1 |π β 1 (in total they cover 2 Β· |π€1 |π β 1 positions). On the other hand, there are |π€1 |β2Β·|π€2 1 |π β1 additional pseudo-cycles blocking just one position each, e.g. 5 and 7 in π€ resp. π€ . For this reason, β(π€1 ) = |π€1 | β |π€1 |β2Β·|π€ β² 2 1 |π β1 β 2 Β· |π€1 |π β 1 and |π€β² |β2Β·|π€β² | β1 β(π€1β² ) = |π€1β² | β 1 2 1 π β 2 Β· |π€1β² |π β 1, and therefore β(π€1β² ) = β(π€1 ) + 1. We observed this phenomenon holds adding iteratively the same number of aβs and bβs (up to length 100). π 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ππ 10 16 22 28 34 40 46 52 58 64 70 76 82 88 Table 2 Conj. 4: For π = 2, there are no binary words of length greater than ππ with π nice positions. Further, the same happens adding 2 bβs and 4 aβs to π€4 , π€5 , π€7 , and adding 4 bβs and 2 aβs to π€2 , π€3 , π€6 , π€8 . Our experiments suggest that for fixed π, there are no words with greater length then those above, see Table 2. On the basis of these observations, we conjecture that the set β± 2 (π) of binary words with π nice positions is finite. Formally: Conjecture 4. For every π β₯ 3, there exists a length ππ such that no word of length greater than ππ has exactly π nice positions. Finally, we noticed from our experiments that the smallest nice position is always a divisor of π. Let π divide π. As we have seen before, π€ = bπ aπ is the BWT of a word π’π with π’ a standard word and π = gcd(π, π) [20]. Thus π is nice by Lemma 1. The standard permutation of π€ has π cycles, and the largest minimum of a cycle is π β 1, blocking all positions π β€ π β 1. Thus, π given π, π where π is a divisor of π, e.g. the word bπ a( π β1)π has smallest nice position π. We conjecture that the converse is true also: Conjecture 5. An integer π can occur as a smallest nice position for some fully clustered word of length π if and only if π divides π. 5. Conclusion and Ongoing Work In this paper we continued the study of a combinatorial problem introduced in [22]: which are the positions where a dollar can be inserted into a word to make it a BWT (so-called nice positions). We strengthened the characterization via pseudo-cycles of nice positions given there. Pseudo-cycles are particular subsets of the indices; here we showed that a much smaller subset of pseudo-cycles suffices to characterize the set of nice positions of a word. We further presented an algorithm that returns the smallest set of pseudo-cycles needed to identify all nice positions of the input word, and using this algorithm, we studied nice positions of fully clustered words over a binary alphabet. We presented properties on the number of nice positions for distinct fully clustered words, and we gave some conjectures on the number of words with fixed number of nice positions, for which we have experimental evidence. As future work we are planning to extend the study to larger alphabets. We already have extensive experimental data for π = 3, and plan to prove related properties as for the binary alphabet. For example, w.r.t. Conj. 4 we could characterize a consistent phenomenon we observe in the histograms for the ternary alphabets. For π β₯ 3 we reach a plateau after a certain length, which we believe to be strongly related to the conjectured ππ . Moreover the words appearing in this plateau seem to be consisting of only two types, namely aπ cπ bβ and cπ bπ aβ . References [1] M. Burrows, D. J. Wheeler, A block-sorting lossless data compression algorithm, Technical Report, DIGITAL System Research Center, 1994. [2] G. Rosone, M. Sciortino, The Burrows-Wheeler transform between data compression and combinatorics on words, in: 9th Conference on Computability in Europe (CiE 2013), Springer, 2013, pp. 353β364. [3] G. Manzini, An analysis of the Burrows-Wheeler transform, J. ACM 48 (2001) 407β430. [4] G. Navarro, Indexing highly repetitive string collections, part I: repetitiveness measures, ACM Computing Surveys 54 (2021) 29:1β29:31. [5] H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform, Bioinformatics 26 (2010) 589β595. [6] B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 10 (2009) R25. [7] T. W. Lam, R. Li, A. Tam, S. Wong, E. Wu, S. M. Yiu, High throughput short read alignment via bi-directional BWT, in: 2009 IEEE International Conference on Bioinformatics and Biomedicine, 2009, pp. 31β36. [8] C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, T. Mun, Prefix-free parsing for building big BWTs, Algorithms Mol. Biol. 14 (2019) 13:1β13:15. [9] P. Ferragina, G. Manzini, Indexing compressed text, J. ACM 52 (2005) 552β581. [10] T. Gagie, G. Navarro, N. Prezza, Optimal-time text indexing in BWT-runs bounded space, in: Proc. of 39th ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), 2018, pp. 1459β1477. [11] M. Crochemore, J. DΓ©sarmΓ©nien, D. Perrin, A note on the Burrows-Wheeler transformation, Theor. Comput. Sci. 332 (2005) 567β572. [12] I. M. Gessel, C. Reutenauer, Counting permutations with given cycle structure and descent set, Journal of Combinatorial Theory 64 (1993) 189β215. [13] S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, L. Versari, Measuring the clustering effect of BWT via RLE, Theor. Comput. Sci. 698 (2017) 79β87. [14] S. Brlek, A. Frosini, I. Mancini, E. Pergola, S. Rinaldi, Burrows-Wheeler transform of words defined by morphisms, in: 30th International Workshop on Combinatorial Algorithms (IWOCA 2019), volume 11638 of Lecture Notes in Computer Science, Springer, 2019, pp. 393β404. [15] S. Giuliani, S. Inenaga, Zs. LiptΓ‘k, N. Prezza, M. Sciortino, A. Toffanello, Novel results on the number of runs of the Burrows-Wheeler-Transform, in: Proc. of 47th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021), volume 12607 of LNCS, 2021, pp. 249β262. [16] A. Frosini, I. Mancini, S. Rinaldi, G. Romana, M. Sciortino, Logarithmic equal-letter runs for BWT of purely morphic words, in: 26th International Conference on Developments in Language Theory (DLT 2022), volume 13257 of Lecture Notes in Computer Science, Springer, 2022, pp. 139β151. [17] R. Giancarlo, A. Restivo, M. Sciortino, From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization, Theoretical Computer Science 387 (2007) 236β248. [18] I. M. Gessel, A. Restivo, C. Reutenauer, A bijection between words and multisets of necklaces, European Journal of Combinatorics 33 (2012) 1537β1546. [19] J. W. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. LΓ©onard, Γ. Prieur-Gaston, A survey of string orderings and their application to the BurrowsβWheeler transform, Theoretical Computer Science 710 (2018) 52β65. [20] S. Mantaci, A. Restivo, M. Sciortino, BurrowsβWheeler transform and Sturmian words, Information Processing Letters 86 (2003) 241β246. [21] K. M. Likhomanov, A. M. Shur, Two Combinatorial Criteria for BWT Images, in: Proceeding of the 6th International Computer Science Symposium in Russia (CSR 2011), 2011, pp. 385β396. [22] S. Giuliani, Zs. LiptΓ‘k, F. Masillo, R. Rizzi, When a dollar makes a BWT, Theoretical Computer Science 857 (2021) 123β146. [23] J. Simpson, S. J. Puglisi, Words with simple Burrows-Wheeler Transforms, Electronic Journal of Combinatorics 15 (2008). [24] A. Restivo, G. Rosone, Burrows-Wheeler transform and palindromic richness, Theor. Comput. Sci. 410 (2009) 3018β3026. [25] A. Restivo, G. Rosone, Balancing and clustering of words in the BurrowsβWheeler trans- form, Theoretical Computer Science 412 (2011) 3019β3032. [26] S. Ferenczi, L. Q. Zamboni, Clustering words and interval exchanges, Journal of Integer Sequences 16 (2013) 3. [27] M. BΓ³na, Combinatorics of Permutations, Second Edition, Discrete mathematics and its applications, CRC Press, 2012.