Preliminaries

Searching All Approximate Covers and Their Distance using Finite Automata

Ondˇrej Guth

Boˇrivoj Melichar

Miroslav Bal´ık

Cˇ esk´e vysok´e uˇcen´ı technick´e v Praze

Praha

gutho

melichar

balikm}@fel.cvut.cz

201

Cover is a type of a regularity of strings. A restricted approximate cover w of string T is a factor of T such that every position of T lies within some approximate occurrence of w in T . In this paper, the problem of all restricted smallest distance approximate covers of a string is studied and a polynomial time and space algorithm for solving the problem is presented. It searches for all restricted approximate covers of a string with given limited approximation using Hamming distance and it computes the smallest distance for each found cover. The solution is based on a finite automata approach, that provides a straightforward way to design algorithms to many problems in stringology. Therefore it is shown that the set of problems solvable using finite automata includes the one studied in this paper.

Preliminaries

An alphabet is a nonempty finite set of symbols, denoted by A. A string over an alphabet is a finite sequence of symbols of the alphabet. Empty string is an empty sequence of symbols, denoted by ε. An effective alphabet of a string T is a set of symbols that really occur in T . Only effective alphabet is considered in this paper. A language is a set of strings. A set of all strings over alphabet A is denoted by A∗. The length of a string w is denoted by |w|, the i–th symbol of w is denoted by w[i]. An operation concatenation is defined in this way: x, y ∈ A∗, concatenation of x and y is xy, may be denoted by x.y. An operation superposition is defined in this way: x = pu, y = us, superposition of x and y is pus. Suppose u, w, x, T ∈ A∗. w is a prefix 1 Introduction of T if T = wu, w is a suffix of T if T = uw, and w is a factor (also called a substring) of T if T = uwx.

A set of all prefixes of T is denoted by Pref (T ), a set Searching regularities of strings is used in a wide area of all suffixes of T is denoted by Suff (T ), and a set of of applications like molecular biology and computer– all factors of T is denoted by Fact (T ). assisted music analysis. One of typical regularities is A deterministic finite automaton (also called a decover. terministic finite state machine, denoted by DFA) is a

Finding exact covers is not sufficient in some appli- quintuple (Q, A, δ, q0, F ), where Q is a nonempty finite cations, thus approximate covers have to be computed. set of states, A is an input alphabet, δ is a transition In this paper, the Hamming distance is considered. function, δ : Q × A 7→ Q, q0 ∈ Q is an initial state and F ⊆ Q is a set of final states.

Exact covers were introduced in [ 1 ], an algorithm A nondeterministic finite automaton without ε– for computation of all exact covers in linear time was transitions is a quintuple (Q, A, δ, q0, F ), where Q is a presented in [ 4 ]. An algorithm using finite automata nonempty finite set of states, A is an input alphabet, approach to computation of all exact covers was intro- δ is a transition function, where δ : Q × A 7→ P (Q), duced in [ 5 ]. q0 ∈ Q is an initial state and F ⊆ Q is a set of final

The algorithm presented in [ 2 ] searches for one re- states. It is denoted by NFA. stricted smallest approximate cover (i.e. cover with the A state q is a successor of state p of a deterministic smallest distance), using dynamic programming. An finite automaton (Q, A, δ, q0, F ) if q = δ(p, a) for some algorithm using finite automata approach to compu- a ∈ A. A state qN is a successor of a state pN of a tation all restricted approximate covers for Hamming, Levenshtein, and Damerau distance was introduced in [ 3 ].

NFA (QN , A, δN , q0N , FN ) if q ∈ δN (pN , a).

String w = a1a2 . . . a|w| is said to be accepted by a

DFA (Q, A, δ, q0, F ) if there exists a sequence

This paper is organized as follows. In Section 2, δ(q0, a1) = q1, δ(q1, a2) = q2, . . . , δ(q|w|−1, a|w|) ∈ F . some notations and definitions used in this paper are String w = a1a2 . . . a|w| is said to be accepted by a described. In Section 3, the algorithm for the problem NFA (Q, A, δ, q0, F ) if there exists a sequence is presented. In Section 4, the complexities of the al- δ(q0, a1) = Q1, δ(q1, a2) = Q2, . . . , δ(q|w|−1, a|w|) ⊆ F gorithm are proven. In Section 5, experimental results for some q1 ∈ Q1, . . . , q|w|−1 ∈ Q|w|−1. A language are shown. accepted by a finite automaton M is denoted by L(M ).

A left language of a state q of a nondeterminis- there exists string s ∈ Suff (T ) such that DH (w, s) ≤ tic finite automaton (Q, A, δ, q0, F ) is a set of strings k. w = a1a2 . . . a|w|, where for each w exists a sequence A nondeterministic Hamming suffix automaton M δ(q0, a1) = Q1, δ(q1, a2) = Q2, . . . , δ(q|w|−1, a|w|) = for a string T and distance k is such nondeterministic Q|w|, q ∈ Q|w| for some q1 ∈ Q1, . . . , q|w|−1 ∈ Q|w|−1. finite automaton without ε–transitions, that L(M ) = A left language of a state q of a DFA (Q, A, δ, q0, F ) {w : DH(w, s) ≤ k, s ∈ Suff (T )}. Such an automaton is a set of strings w = a1a2 . . . a|w|, where for each w M = (Q, A, δ, q0, F ) may be constructed in this way: exists a sequence δ(q0, a1) = q1, δ(q1, a2) = q2, . . . , δ(q|w|−1, a|w|) = q. 1. Create a layer of |T | + 1 states:

A maxfactor of a state q of a DFA (Q, A, δ, q0, F ) (a) each state qi0 corresponds to a position i in T is the longest string of left language of q, denoted by (plus initial state q0, thus 0 < i ≤ |T |), maxfactor (q). A depth of a state q of a DFA is the (b) for each state qi0 (but the last q|0T |) define tranlength of maxfactor (q), denoted by depth(q).

A DFA MD = (Q, A, δ, q0, F ) is equivalent to a NFA MN = (QN , A, δN , q0N , FN ) if L(MN ) = L(MD).

Subset construction may be used: sition δ(qi0, T [i]) = qi0+1, (c) define the last state q|0T | final (note that until

now such automaton accepts exactly T ). 2. Similarly, create a layer for each “number of errors” l, 1 ≤ l ≤ k (only exception: we do not need any state qil for l > i). 3. For each state qil (but the last q|T | in each layer and but the last layer) and for each symbol a ∈ A, a 6= T [i] (not occurring in T at position i), define transition δ(qil, T [i]) = qil++11. 4. Create “long” transitions from q0: δ(q0, a) = {qi0 : a = T [i], a ≤ i ≤ |T |}∪{qi1 : a 6= T [i], 1 ≤ i ≤ |T |}. 1. Set Q = {{q0}} will be defined, state q0 = {q0N }

will be treated as unmarked. 2. If each state in Q is marked then continue with

step 4. 3. Unmarked state q will be chosen from Q and the following operations will be executed: (a) δ(q, a) = S δN (pN , a) for pN ∈ q and for all

a ∈ A, (b) Q = Q ∪ δ(q, a) for all a ∈ A, (c) state q ∈ Q will be marked, (d) continue with step 2. 4. F = {q : q ∈ Q, pN ∩ FN 6= ∅, pN ∈ q}.

For example of a transition diagram of a nondetermin

istic Hamming suffix automaton see Fig. 1.

A level of a state of a nondeterministic Hamming suffix automaton corresponds to the number of errors, a depth of a state of this automaton is equal to the Using subset construction of MD equivalent to MN , corresponding position in T . every state qD ∈ Q corresponds to some subset of QN .

This subset is called a d–subset, denoted by d(qD). Definition 1 (Restricted approximate cover). Let Each element of the d–subset corresponds to some T and w be strings. We say, that w is a restricted apstate of QN . Where no confusion arises, depth of a proximate cover of T with Hamming distance k if w is state corresponding to an element rj ∈ d(qD) of d– a factor of T and there exist strings s1, s2, . . . , sr (all subset d(qD) is simply denoted by rj, as numeric rep- some substrings of T ) such that: resentation of rj corresponds to the depth. In the algorithms below, d–subset is supposed to be implemented 1. DH (w, si) ≤ k for all i where 1 ≤ i ≤ r, as a list, preserving order of its elements. An element 2. T can be constructed by superpositions and conof the d–subset is denoted by ri, where the subscript catenations of copies of the strings s1, s2, . . . , sr. i means an index (order) of the element ri within the d–subset. Note 1. An approximate cover is more general regu

A distance is the minimum number of editing op- larity than restricted approximate cover, because (unerations that are necessary to convert a string x into restricted) approximate cover of T needs not be a faca string y. The maximum allowed distance is denoted tor of T . In this paper, only restricted approximate by k. cover is considered.

The Hamming distance between strings x and y is equal to the minimum number of editing operations Definition 2 (Restricted smallest distance apreplace that are necessary to convert x into y. The proximate cover). Let T and w be strings. We say, Hamming distance function is denoted by DH . that w is a restricted smallest distance approximate

String w ∈ A∗ is an approximate prefix of a string cover of T with distance k if w is a restricted approxT ∈ A∗ with the maximum Hamming distance k if imate cover of T with the distance k and there exists there exists string p ∈ Pref (T ) such that DH (w, p) ≤ no l < k such that w is a restricted approximate cover k. String w is an approximate suffix of the string T if of T with the distance l.

Problem 1 (All restricted smallest distance approximate computed. When q represents an approximate prefix, covers of a string). Given string T over alphabet A, its successors are recursively generated and processed. Hamming distance function DH and distance k, find Note that the set of final states of the deterministic all restricted approximate covers of T and their small- approximate cover candidate automaton is not needed est distances. A set of all restricted smallest distance (it would contain all states having d–subsets containapproximate covers of string T under Hamming dis- ing element corresponding to some final state of the tance k is denoted by coversH k (T ). nondeterministic Hamming suffix automaton). Definition 3 (Approximate cover candidate automaton). An approximate cover candidate automaton (Q, A, δ, q0, F ) for string T ∈ A∗, Hamming distance function DH and the maximum distance k accepts set W = {w1, w2, . . . , wl} of factors of T , where for each wi ∈ W holds: Distance l of each cover w = maxfactor (q) may

As any approximate cover of a string T under Ham- vary between 0 and k. Moreover, it cannot be less than ming distance is an approximate prefix and an approx- level of the first or the last element of d(q), because imate suffix of T (proven in [ 3 ]), an automaton accept- each cover must be an approximate prefix and suffix. ing only such strings can be used. Of course, it cannot be more than the maximum level of elements of d(q). The Algorithm 1 removes all the elements having the maximum level but the first and the last element of d(q), and tries whether w covers T without those removed positions.

1. there exists p ∈ Pref (T ) such that DH (p, wi) ≤ k,

and 2. there exists s ∈ Suff (T ) such that DH (s, wi) ≤ k.

In [ 3 ], a construction of an automaton accepting intersection of approximate prefixes and approximate suffixes is used for construction of a deterministic approximate cover candidate automaton. Although this is a straightforward idea, specialized method (more effective) is presented for Hamming distance in the following section. 3

Problem solution

The principle of the solution is following: first, we perform a subset construction of a deterministic cover candidate automaton from a nondeterministic Hamming suffix automaton for string T and k, as every d(q) represents a set of positions of w = maxfactor (q) within T . If we treat with d(q) as with a sorted list (ordered by depths of its elements), each pair of subsequent elements represents positions of subsequent occurrences of w within T . When for such positions i, j, i < j holds j − i > |w|, we know that w cannot be a cover of T . The distance of w is the minimum l such that it is possible to remove all elements r ∈ d(q) having level (r) > l and the previous condition holds.

In fact, it is not necessary to save complete deterministic automaton. Unlike in [ 3 ], we do not make construction of the deterministic cover candidate automaton and subsequent computation of covering. A depth–first search algorithm is used to perform subset construction and computation of covering and of the distance of each cover: in Algorithm 2, for each state and symbol, a successor q is generated, it is determined whether it represents a cover and the distance is

Algorithm 1 Smallest distance of a cover of T .

Input: d–subset d(q) representing a cover w of T . Output: The smallest distance l of w. 1: lmin ← max{level (r1), level (r|d(q)|)} 2: lmax ← maxr∈d(q){level (r)} 3: l ← lmax 4: repeat 5: for all r ∈ d(q) \ {r1, r|d(q)|} : level (r) = l do 6: remove r from d(q) 7: end for 8: l ← l − 1 9: until l ≥ lmin and for all i = 2, 3, . . . , |d(q)| : ri−ri−1 ≤ depth (q) 10: l ← l + 1.

Example 1. Let us have a string T = aabccccb over alphabet A = {a, b, c} and let us compute a set of all restricted smallest distance approximate covers of T under Hamming distance k = 2 using Algorithm 3.

Because of the distance 2, we are interested in covers of length at least 3 or having distance less than 2. We construct a nondeterministic Hamming suffix automaton MS (see Fig. 1), then an approximate cover candidate automaton M is analysed (see Fig. 2).

Looking at the d–subset {3, 4′′, 8′′}, it represents an approximate prefix and suffix aab of length 3, but for its positions holds 8 − 4 3, thus the factor aab is not an approximate cover of T with Hamming distance 2. Looking at the other d–subset {3′′, 5′′, 6′, 7′, 8}, it represents factor ccb, that covers T with Hamming distance 2. It is checked whether it covers T with distance 1 (Alg. 1). As the first element of the d–subset has level equal to 2, lmin is equal to 2. The resulting set of the covers is coversaabccccbH (2 ) = {(ccb, 2), (aabccccb, 0)}. b c 1′2′34′5′6′7′8 1′2′3′45678′ c b c 2′′3′45′6′7′8′′ 2′′3′4′′5′6′7′8 2′′3′′4′5678′ b 3′′5′′6′7′8 a 123′4′5′6′7′8′ ab 23′4′′5′′6′′7′′8′′ b c c 56′′ c 67′′ c Proof. The automaton consists of layers of states q(i) for each level i. The layer of states q0 contains |T | + 1 states. Each layer of states q(i) contains one state less in comparison with layer of states q(i−1), thus it contains |T | − i + 1 and layer of states q(k) contains |T | − k + 1 states.

The automaton contains |A| transitions from each state, with some exceptions. There are k+1 final states having no successor. In the layer of states q(k), each state has only one successor. From the initial state, there are |T | transitions defined to the states q(0) having level (q(0)) = 0 and |T | · (|A| − 1) transitions to the states q(1) having level (q(1)) = 1. Thus in MS there are |Q|·|A|+ |T |·|A|− (k + 1)·|A|−(|T |−k + 1)·(|A|−1) = |A| · (|Q| − 2) + |T | − k + 1 transitions.

Note 2. As restricted approximate covers of string T are exact factors of T , it is meaningful to consider effective alphabet A only, thus |A| ≤ |T | always holds.

It is also meaningless to consider large k, because every (QS, A, δS, q0S, FS), there is no need for any additional factor of T having length less or equal to k is always data structures. For the purpose of the construction of approximate cover of T . Thus k ≤ |T | always holds. the deterministic cover candidate automaton M , only

Usually, k ≪ |T | and |A| ≪ |T | (e.g. in DNA anal- the set of states and transitions from q0S need to be ysis, A = {a, c, g, t}). Therefore k and |A| may be con- preserved, because the rest may be computed later in sidered as small constants independent of |T |. O(1) time and space using knowledge of a depth and a level of a state, k, and T . Thus the space complexity Lemma 2. The deterministic approximate cover can- of this construction is O((k + |A|) · |T |). didate automaton M for string T and the Hamming During the computation of the smallest distance distance contains at most |T |22+|T | + 1 states. (Algorithm 1), only O(1) additional data is needed. During the processing of states of M (Algorithm 2), Proof. Each d–subset d(q) of M contains at least one r the needed space is limited by the number of elements such that level (r) = 0, thus maxfactor (q) ∈ Fact (T ). of all d–subsets (Lemma 4) preserved in a memory and The number of possible factors of length depth(q) is by the number of all approximate covers (the result, at most |T | − depth(q) + 1, thus the maximum num- limited by the number of all factors of T – at most ber of states of M having equal depth is also |T | − O(|T |2)). depth(q) + 1. The automaton M also contains an initial state. Therefore, the number of states of M is at most (|T |−1+1)+(|T |−|T |+1) · |T | + 1.

Lemma 5. Using Algorithms 2, and 3 for construc

tion of a deterministic cover candidate automaton M = (Q, A, δ, q0, F ) from a nondeterministic Hamming sufLemma 3. During the construction of the determin- fix automaton MS = (QS, A, δS, q0S, FS), all d–subsets istic cover candidate automaton M for string T , Al- are sorted in ascending order by depths within MS. gorithms 2, 3 need to hold at most |T | + 2 states at a time.

Proof. Having p, q ∈ Q\{q0} such that q is a successor of p, suppose that d(p) is sorted in order by depths Proof. Algorithm 2 works as a depth–first search al- within MS. It holds that for any pS, qS ∈ QS such gorithm. For each state and symbol it generates at that qS is a successor of pS, depth(qS) > depth(pS). most one state – possible successor. Thus it holds at Therefore d(q) constructed from already sorted d(p) is most |T | + 1 states of M (|T | states having d–subsets also sorted. representing exact prefixes of T plus initial state) and For p = q0, it is supposed that δS(q0S , a) is cona state generated for a final state, having empty d– structed as sorted in order by depths within MS. subset.

Lemma 6. Time complexity of Algorithm 1 is O(k · |T |) for each state. Lemma 4. During the construction of the determin

istic cover candidate automaton M for string T , Algorithms 2, 3 need to hold at most |T |2+|T | + 1 elements 2 of d–subsets at a time.

Proof. Algorithm 1 may remove some elements of a

d–subset in each iteration, thus the iteration may take O(|T |) time. The number of iterations may be at most k.

Proof. Alg. 2 needs at most |T | + 2 states in a memory Lemma 7. Time complexity of Algorithm 2 (from the at a time (Lemma 3). The deterministic cover candi- initial state) is O((k + |A|) · |T |3). date automaton M = (Q, A, δ, q0, F ) is constructed Proof. Algorithm 2 constructs for all states q and all by subset construction from a nondeterministic Ham- a ∈ A the d–subsets of all possible successors of q. ming suffix automaton MS = (QS, A, δS, q0S, FS ). In The number of states is O(|T |2) (Lemma 2) and the MS, each state but q0S has at most one successor number of elements of each d–subset is O(|T |). For for each symbol, q0S has |T | successors for each sym- each state, the computation of covering is performed bol. For each state pS and its successor qS in MS (it takes O(|T |)), and for each cover (their number is holds: depth(qS ) > depth(pS). The longest possible O(T 2)), the computation of the smallest distance is d–subset d(p) contains r|T | having depth(r|T |) = |T |, performed (it takes O(k · |T |) for each cover – Lemma and r1 having depth(r1) = 1. As |δS(r1, a)| ≤ 1 and 6). δS(r|T |, a) = ∅ for every a ∈ A, for state p and its successor q in M holds: |d(q)| ≤ |d(p)| for p 6= q0 and Theorem 2. Time complexity of Alg. 3 is O((k+|A|)· |d(q)| ≤ |T | for p = q0. |T |3).

Theorem 1. Space complexity of Alg. 3 is O(|T |2). Proof. It clearly holds that construction of the nondeterministic Hamming suffix automaton takes O((k + Proof. It clearly holds that for construction of the |A|) · |T |). Construction of the deterministic cover cannondeterministic Hamming suffix automaton MS = didate automaton takes O((k + |A|) · |T |3) (Lemma 7).

Experimental results The algorithm was implemented in C++ using STL,

the program was compiled using the GNU C++ compiler with O3 optimizations level. The dataset used to test the algorithm is the nucleotide sequence of Saccharomyces cerevisiae chromosome IV1. The string T consists of the first |T | characters of the chromosome.

The first set of tests was run on a AMD Athlon 64 3200+ (2200 MHz) system, with 2.5 GB of RAM, under Fedora Linux operating system (see Figs. 3, 4).

Athlon64, for k=11 and k=31 12 10 8 ] c e [se 6 m i T 4 2 00 1400 1200 1000 ]cse 800 [ ieTm 600 400 300 400

Text length 100 200 500 600 700 Fig. 3. Time consumption with respect to the text size (solid line for k = 11, dotted one for k = 31)

Athlon64, for |T|=1162 and |T|=1550

400

Text length 200 300 500 600 700 Fig. 5. Time consumption with respect to the text size (solid line for k = 101, dotted one for k = 201)

Athlon, for |T|=114 and |T|=153 2.5 2.0

In this paper, we have shown that an algorithm de

200 sign based on a determinisation of a suffix automa00 200 400 60M0aximu8m00distanc1e000 1200 1400 1600 taopnpriosxai mppartoepcroiavteersfoorf aallstrreisntgripctreodblsemmaflolerstHdaimstma nincge distance. The presented algorithm is straightforward, Fig. 4. Time consumption with respect to the distance easy to understand and to implement and its theoreti(solid line for |T | = 1162, dotted one for |T | = 1550) cal and experimental time requirements are comparable to the existing approach ([ 2 ]).

The algorithm may be extended to work with other distance functions, possibly using the idea presented

The second set of tests was run on a AMD Athlon in [ 3 ]. Theoretical and experimental analysis similar (1400 MHz) system, with 1.2 GB of RAM, under Gen- to one presented here may be accomplished. The algotoo Linux operating system (see Figs. 5, 6). rithm may be also extended to use parallelism.

Note 3. In comparison with experimental results pre

sented in [ 2 ], the algorithm presented in this paper runs a bit faster for the same data, even on a slightly slower computer (1.3 seconds in [ 2 ] for text length 100

This research was partially supported by the Ministry

1 The Saccharomyces cerevisiae chromosome IV dataset of Education, Youth, and Sport of the Czech Repubcould be downloaded from http://www.genome.jp/. lic under research program MSM 6840770014, by the

Acknowledgements

1. Apostolico , A. , Farach , M. , and Iliopoulos , C. S. Optimal superprimitivity testing for strings . Inf. Process. Lett. 39 , 1 ( 1991 ), 17 - 20 .

2. Christodoulakis , M. , Iliopoulos , C. S. , Park , K. , and Sim , J. S. Implementing approximate regularities . Mathematical and Computer Modelling 42 ( October 2005 ), 855 - 866 .

3. Guth , O. Searching approximate covers of strings using finite automata . In Proceedings of POSTER ( 2008 ), Faculty of Electrical Engineering, Czech Technical University in Prague.

4. Smyth , W. F. Approximate periodicity in strings . Utilitas Mathematica 51 ( 1997 ), 125 - 135 .

5. Vora´cˇek, M. , and Melichar , B. Searchig for regularities in generalized strings using finite automata . In Proceedings of the International Conference on Numerical Analysis and Applied Mathematics ( 2005 ), WILEY - VCH Verlag, pp. 809 - 812 .