Optimal Computation of all Repetitions
            in a Weighted String
                        Carl Barton                Solon P. Pissis
          Department of Informatics, King’s College London, London, UK
                         {Carl.Barton | Solon.Pissis}@kcl.ac.uk


                                         Abstract
A repetition in a string of letters consists of exact concatenations of identical
factors of the string. Crochemore’s repetitions algorithm, usually also referred
to as Crochemore’s partitioning algorithm, was introduced in 1981, and was the
first optimal O(n log n)-time algorithm to compute all repetitions in a string of
length n. A weighted string is a string in which a set of letters may occur at each
position with respective probabilities of occurrence. In this article, we present a
new variant of Crochemore’s partitioning algorithm for weighted strings, which
requires optimal time O(n log n), thus improving on the best known O(n2 )-time
algorithm for computing all repetitions in a weighted string of length n.


1    Introduction
A fundamental structural characteristic of a string of letters is its periodicity.
Closely related to periodicity is the notion of repetition. Repetitions in strings
are highly periodic factors, that is, two or more adjacent identical factors. For
instance, string abab is a repetition in string aababba. In 1981, it was shown by
Crochemore that there could be O(n log n) repetitions in a string of length n and
an O(n log n)-time, thus optimal, algorithm was presented [1].
   Single nucleotide polymorphisms, as well as errors from wet-lab sequencing plat-
forms during the process of DNA sequencing, can occur in some positions of a DNA
sequence. In some cases, these errors can be accurately modelled as a don’t care
letter. However, in other cases the errors can be more subtly expressed, and, at
each position of the sequence, a probability of occurrence can be assigned to each
letter of the nucleotide alphabet; this process gives rise to a weighted string.
   Recently, the authors of [4] proposed an O(n2 )-time algorithm for computing
all repetitions in a weighted string x of length n. The efficiency of the proposed
algorithm relies on the assumption of a given constant, the cumulative weight
threshold, defined as the minimal probability of occurrence of factors in x.

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: Costas S. Iliopoulos, Alessio Langiu (eds.): Proceedings of the 2nd International Conference
on Algorithms for Big Data, Palermo, Italy, 7-9 April 2014, published at http://ceur-ws.org/


                                               9
Optimal Computation of all Repetitions in a Weighted String

Our Contribution
We present the first optimal algorithm for computing all repetitions in a weighted
string. We improve on the best-known algorithm for computing all repetitions in
a weighted string of length n from time O(n2 ) to an optimal O(n log n).

2    Preliminaries
An alphabet ⌃ is a finite non-empty set of size , whose elements are called letters.
A string on an alphabet ⌃ is a finite, possibly empty, sequence of elements of ⌃.
The zero-letter sequence is called the empty string, and is denoted by ". The length
of a string x is defined as the length of the sequence associated with the string x,
and is denoted by |x|. We denote by x[i], for all 0  i < |x|, the letter at index i of
x. Each index i, for all 0  i < |x|, is a position in x when x 6= ". It follows that
the ith letter of x is the letter at position i in x.
    The concatenation of two strings x and y is the string of the letters of x followed
by the letters of y. It is denoted by xy. A string x is a factor of a string y if there
exist two strings u and v, such that y = uxv. Consider the strings x, y, u, and v,
such that y = uxv. If u = ", then x is a prefix of y. If v = ", then x is a suffix
of y. Let x be a non-empty string and y be a string. We say that there exists an
occurrence of x in y, or, more simply, that x occurs in y, when x is a factor of y.
Every occurrence of x can be characterised by a position in y. Thus we say that x
occurs at the starting position.
    A weighted string x on an alphabet ⌃ is a finite sequence of n sets. Every x[i],
for all 0  i < n, is a set of ordered pairs (sj , ⇡i (sj )), where sj 2 ⌃ and ⇡i (sj ) is
the probability of P  having letter sj at position i. Formally, x[i] = {(sj , ⇡i (sj ))|sj 6=
s` for j 6= `, and      j ⇡i (sj ) = 1}. A letter sj occurs at position i of a weighted
string x if and only if the occurrence probability of letter sj at position i, ⇡i (sj ),
is greater than 0. A string u of length m is a factor of a weighted string if
andQonly if it occurs at starting position i with cumulative occurrence probabil-
        m 1
ity j=0 ⇡i+j (u[j]) > 0. Given a cumulative weight threshold 1/z 2 (0, 1], we say
that factor u is valid, or equivalently Qm 1that factor u has a valid occurrence, if it
occurs at starting position i and j=0 ⇡i+j (u[j]) 1/z.
    For every string x and every natural number n, we define the nth power of the
string x, denoted by xn , by x0 = " and xk = xk 1 x, for all 1  k  n. A string
is said to be primitive if it cannot be written as v e , where e 2. A repetition in
x is a non-trivial power of a primitive string occurring in x. Formally, a repetition
ue , e      2, in x is defined as a triple (i, p, e) such that: u = x[i . . i + p 1] =
x[i + p . . i + 2p 1] = . . . = x[i + (e 1)p . . i + ep 1]; ue+1 does not occur at
position i; and u is primitive. A repetition is maximal if i p < 0 or ue does not
occur at x[i p]. The integers p and e are called the period and the exponent of
the repetition, respectively. If e = 2 the repetition is called square.
    A repetition v = ue , e         2, in a weighted string x is defined as a quadruple
(i, p, b, e) such that u = v[0 . . p 1] = v[p . . 2p 1] = . . . = v[(e 1)p . . ep 1],
where v is a factor of length ep of x occurring at position i, and each occurrence of
u in v is a valid factor of x; ue+1 does not occur at position i; u is primitive; and
b is a set of ordered pairs (j, a), where 0  j < p and a 2 ⌃, denoting u[j] = a. A


                                             10
Optimal Computation of all Repetitions in a Weighted String

repetition is maximal if i p < 0 or ue does not occur at x[i       p]. In this article,
we are mainly concerned with the following problem.
Problem 2.1 Given a weighted string x of length n and a cumulative weight
threshold 1/z, find all repetitions in x.

3   Algorithm
We first perform a colouring stage on x, similar to the one before the construction of
the weighted suffix tree [3], which assigns a colour to every position in x according
to the following scheme: mark position i black, if none of the possible letters at
position i has probability of occurrence greater than 1 1/z; mark position i grey,
if one of the possible letters at position i has probability of occurrence greater
than 1 1/z; mark position i white, if one of the possible letters at position i has
probability of occurrence 1.
Lemma 3.1 ([3]) A valid factor of x contains at most dlog z/ log( z z 1 )e black
positions.
We then perform a generation stage, as the one performed during the construction
of the weighted suffix tree, where a set of factors of x is generated. We refer to
this set as extended factors (for a definition, see [3]).
Lemma 3.2 ([3]) A valid factor of x occurs in at least one of its extended factors.
An extended repetition is a repetition occurring in an extended factor of x. A
valid repetition v = ue , e   2, in x is defined as a quadruple (i, p, b, e) such that
u = v[0 . . p 1] = v[p . . 2p 1] = . . . = v[(e 1)p . . ep 1], where v is a valid
factor of length ep of x occurring at position i; ue+1 is not a valid factor of x; u
is primitive; and b is a set of ordered pairs (j, a), where 0  j < p and a 2 ⌃,
denoting u[j] = a. We define the following subproblem.
Problem 3.3 Given a weighted string x of length n and a cumulative weight
threshold 1/z, find all valid repetitions in x.

Lemma 3.4 Every valid repetition in x occurs in at least one extended factor.

For each generated extended factor, we run Crochemore’s partitioning algorithm
for maximal repetitions; the result is all the maximal extended repetitions in x.
After computing all the maximal extended repetitions we cannot simply report all
of these as valid repetitions. All valid factors must occur in an extended factor but
extended factors may contain factors which are not valid. This is a consequence of
treating grey positions as white during the generation of extended factors [3]. Since
not all maximal extended repetitions are valid repetitions, we must therefore break
up these maximal extended repetitions into valid repetitions to solve Problem 3.3.
   In order to break up the maximal extended repetitions, we must compute some
additional information. To determine how long any valid repetition should be, we
must know, for each position i in an extended factor, the length of the longest valid
factor starting at position i. The computation is based on the observation that


                                          11
Optimal Computation of all Repetitions in a Weighted String

the longest factor with probability greater than or equal to 1/z for the position
i + 1 has length greater than or equal to that of position i. To compute this, we
maintain an additional cumulative weight threshold ⇡. We store the computed
lengths in an array LF of integers.
   We start with the first position in an extended factor and naively compute the
longest factor within the threshold by multiplying together the probability of the
letters we encounter and storing this in ⇡. If multiplying the probability of some
position j > 0 causes ⇡ < 1/z we set LF[0] := j 1. To proceed, we remove
by division the occurrence probability of the first letter from ⇡. If ⇡ < 1/z then
LF[1] = j 1; otherwise, we continue as before multiplying the probability of
j + 1, j + 2, and so on, until the threshold is once again violated. For each extended
factor this takes time and space proportional to its length. The sum of lengths of
the extended factors is linear in n by the following statement.
Lemma 3.5 ([3]) The sum of lengths of the extended factors of x is O(n).
The next step is to determine the set b for each maximal extended repetition.
This can be done in constant time per maximal extended repetition. We compute
an array NB of integers of size n, such that for each position i in x, NB[i] stores
the index of the leftmost black position j > i; this can be done in linear time in
n. For each maximal extended repetition ue , we check all black positions in the
first occurrence of u. There can only be a constant number of black positions in
u; finding the black positions using NB takes time proportional to their number.
It is now a simple case of recording the position and the letter present in the
extended factor; this takes constant time per maximal extended repetition, so time
proportional to the number of maximal extended repetitions in total.
    Given all the maximal extended repetitions, we can now begin to break them
up into valid repetitions. To achieve this, we can check the length of the longest
factor starting at position i of the extended factor, and then determine the longest
possible repetition starting from i. We can continue checking the maximal extended
repetition in this manner reporting the length as we go. Note that in the worst
case, for each maximal extended repetition ue , we may check the starting position
of each occurrence of u. As we show later (Lemma 3.7), this can be done efficiently.
We now establish the maximal number of extended repetitions in x. Note that the
work done by the algorithm so far is no more than the maximal number of extended
repetitions.
Lemma 3.6 There could be O(n log n) extended repetitions in x.
As previously mentioned, whilst breaking some maximal extended repetition ue
into valid repetitions, we may need to check up to e positions. The maximum
number of checks required will be the sum of the exponents of all maximal extended
repetitions returned by the partitioning. Now we establish the maximal sum of the
exponents of maximal extended repetitions in a weighted string.
Lemma 3.7 The sum of exponents of maximal extended repetitions in x is
O(n log n).
Note that an analogous version of Lemma 3.6 holds for valid repetitions.


                                          12
Optimal Computation of all Repetitions in a Weighted String

Theorem 3.8 Problem 3.3 can be solved in optimal time O(n log n).

At this point, we have solved the subproblem which forms the basis for our solu-
tion. Intuitively, the subproblem finds repetitions v = ue , where factor v occurs
with probability 1/z. The idea behind our solution to Problem 2.1 is based on
the observation that a repetition of exponent e            3 is composed of overlapping
occurrences of smaller repetitions. We intend to compute smaller repetitions and,
from this, derive larger ones. Part of the process of computing valid repetitions
was to break up maximal extended repetitions below the threshold into smaller
valid repetitions. To determine the repetitions specified in Problem 2.1, we reverse
this process and compose longer repetitions from small valid repetitions.
   In order to solve Problem 2.1, we start by solving Problem 3.3 for threshold
k = 1/z 2 . The number of valid repetitions reported for k can be shown to be
O(n log n) by the same argument as for Lemma 3.6; and the number of black
positions in a valid factor is only a constant amount higher than for the original
threshold by a similar argument to the proof of Lemma 3.1. We pick k = 1/z 2 as
we wish to guarantee that we will at least find squares such that each half may
have probability greater than or equal to 1/z. We may also find repetitions with
a higher exponent and repetitions which have a probability less than 1/z, but we
will explain how to filter these out using the same techniques as for Problem 3.3.
   We alter the solution to Problem 3.3 to simplify the solution to Problem 2.1.
Instead of breaking up maximal extended repetitions into valid repetitions, we
break them into all their valid overlapping squares. There are no more than
O(n log n) valid squares by [2]. This can be shown by an almost identical ar-
gument as Lemma 3.6. To split maximal extended repetitions into their valid
overlapping squares, we process them one by one and create a new square for each
overlapping square in the maximal extended repetition. We only need to perform
this on maximal extended repetitions of exponent e             3, and this will take time
proportional to the sum of the exponents which, by Lemma 3.7, is O(n log n).
   To perform the filtering step, we must check if both halves of the square are
above the threshold 1/z. To check each half, we compute, for each position i in
an extended factor, the length of the longest valid factor starting at position i.
During the generation of extended factors for the threshold k, we at the same
time determine the longest factor with probability greater than or equal to 1/z
by computing an array LF0 which stores the analogous information. Filtering the
squares in time proportional to their number can be done by checking that the
length stored in the array is greater than or equal to the period of the square.
   After the filtering step, we have a set of quadruples (i, p, b, e) representing all
primitive squares such that each half of a square has a probability of occurrence at
least 1/z. Now, for every position i in x, we declare an array Ai of linked lists, such
that the linked list Ai [fi (j)], fi : [1, bn/2c] ! [0, O(log n)], stores all the squares
which occur at position i with period j 2 [1, bn/2c]. We now wish to establish the
size of Ai and the size of the linked lists stored at any Ai [fi (j)], but first we state
a property of valid factors required to show properties of Ai .

Lemma 3.9 A valid factor of x is in O(1) extended factors of x.


                                           13
Optimal Computation of all Repetitions in a Weighted String
                                                             p
Lemma 3.10 Ai is of size O(log n), where            = (1 +       5)/2, and the size of any
linked list Ai [fi (j)] is O(1).

We can now construct the repetitions specified in Problem 2.1. For each position
i, we iterate through the linked lists of array Ai . We iterate through each linked
list Ai [fi (p)], where p is the considered period. We process each square element
(i, p, b, e) 2 Ai [fi (p)] to extend the corresponding square as much as possible, by
checking for an occurrence of the square at position i + p. For a linear string, it is
simple to determine this. For each pair of overlapping squares, the second half of
the first square is the first half of the second square; so it suffices to check whether
there exists a square at position i + p with the same period.

Example Consider y = ababab that contains the following primitive squares:
(0, 2, 2), (1, 2, 2), and (2, 2, 2); we wish to find the repetition (0, 2, 3). We start at
position 0 of y with (0, 2, 2) and check if there is a square of period 2 starting at
position 2. A matching square exists so we extend the repetition and check position
4. There is no square at position 4 so we report the repetition (0, 2, 3).

For weighted strings the approach is very similar, with the addition of a few,
constant-time, checks. We must check, for each pair of overlapping squares, if
the black positions from the first square match with the black positions from the
second square. There is a constant number of black positions so this takes constant
time. Each time we find such overlapping squares, we extend our repetition and
delete the square at position i + p from the corresponding list. As soon as we find
a position where we cannot extend the repetition we stop. We continue doing this
until we have found all repetitions.
   Each time we iterate through a linked list, a square may be added to the repeti-
tion we are extending; this takes constant time per list by Lemma 3.10. After each
square is added to the repetition, it is deleted so is not considered again. There
are O(n log n) squares in the array and from the above description we can see that
each square is considered a constant number of times. It is clear that we construct
no more repetitions than there are primitive squares, so the number of constructed
repetitions is also O(n log n). These repetitions will be maximal, and to report
repetitions specified in Problem 2.1, we may check the start of each occurrence in
the repetition and report them. This takes no more than the sum of exponents
which is O(n log n). We can now state the main result of this article.

Theorem 3.11 Problem 2.1 can be solved in optimal time O(n log n).

References
[1] M. Crochemore. An optimal algorithm for computing the repetitions in a word.
    Inf. Process. Lett., 12(5):244–250, 1981.

[2] M. Crochemore, L. Ilie, and W. Rytter. Repetitions in strings: Algorithms
    and combinatorics. Theoretical Computer Science, 410(50):5227 – 5235, 2009.
    Mathematical Foundations of Computer Science (MFCS 2007).


                                           14
Optimal Computation of all Repetitions in a Weighted String

[3] C. S. Iliopoulos, C. Makris, Y. Panagis, K. Perdikuri, E. Theodoridis, and
    A. Tsakalidis. The weighted suffix tree: An efficient data structure for handling
    molecular weighted sequences and its applications. Fundam. Inf., 71(2,3):259–
    277, Feb. 2006.
[4] H. Zhang, Q. Guo, and C. S. Iliopoulos. Locating tandem repeats in weighted
    sequences in proteins. BMC Bioinformatics, 14(S-8):S2, 2013.


                                          15