=Paper= {{Paper |id=Vol-414/paper-4 |storemode=property |title=Searching all approximate covers and their distance using finite automata |pdfUrl=https://ceur-ws.org/Vol-414/paper4.pdf |volume=Vol-414 |dblpUrl=https://dblp.org/rec/conf/itat/GuthMB08 }} ==Searching all approximate covers and their distance using finite automata== https://ceur-ws.org/Vol-414/paper4.pdf
Searching All Approximate Covers and Their Distance using Finite
                          Automata

                                Ondřej Guth, Bořivoj Melichar, and Miroslav Balı́k

                                   České vysoké učenı́ technické v Praze, Praha, CZ,
                                       {gutho1,melichar,balikm}@fel.cvut.cz

Abstract. Cover is a type of a regularity of strings. A re-     2    Preliminaries
stricted approximate cover w of string T is a factor of T
such that every position of T lies within some approximate
occurrence of w in T . In this paper, the problem of all re-
                                                                An alphabet is a nonempty finite set of symbols, de-
stricted smallest distance approximate covers of a string is    noted by A. A string over an alphabet is a finite se-
studied and a polynomial time and space algorithm for solv-     quence of symbols of the alphabet. Empty string is an
ing the problem is presented. It searches for all restricted    empty sequence of symbols, denoted by ε. An effective
approximate covers of a string with given limited approxi-      alphabet of a string T is a set of symbols that really
mation using Hamming distance and it computes the small-        occur in T . Only effective alphabet is considered in
est distance for each found cover. The solution is based on     this paper. A language is a set of strings. A set of all
a finite automata approach, that provides a straightforward     strings over alphabet A is denoted by A∗ . The length
way to design algorithms to many problems in stringology.       of a string w is denoted by |w|, the i–th symbol of w is
Therefore it is shown that the set of problems solvable using   denoted by w[i]. An operation concatenation is defined
finite automata includes the one studied in this paper.
                                                                in this way: x, y ∈ A∗ , concatenation of x and y is xy,
                                                                may be denoted by x.y. An operation superposition is
                                                                defined in this way: x = pu, y = us, superposition of
                                                                x and y is pus. Suppose u, w, x, T ∈ A∗ . w is a prefix
1    Introduction                                               of T if T = wu, w is a suffix of T if T = uw, and w
                                                                is a factor (also called a substring) of T if T = uwx.
                                                                A set of all prefixes of T is denoted by Pref (T ), a set
Searching regularities of strings is used in a wide area        of all suffixes of T is denoted by Suff (T ), and a set of
of applications like molecular biology and computer–            all factors of T is denoted by Fact (T ).
assisted music analysis. One of typical regularities is             A deterministic finite automaton (also called a de-
cover.                                                          terministic finite state machine, denoted by DFA) is a
    Finding exact covers is not sufficient in some appli-       quintuple (Q, A, δ, q0 , F ), where Q is a nonempty finite
cations, thus approximate covers have to be computed.           set of states, A is an input alphabet, δ is a transition
In this paper, the Hamming distance is considered.              function, δ : Q × A 7→ Q, q0 ∈ Q is an initial state and
                                                                F ⊆ Q is a set of final states.
    Exact covers were introduced in [1], an algorithm
                                                                    A nondeterministic finite automaton without ε–
for computation of all exact covers in linear time was
                                                                transitions is a quintuple (Q, A, δ, q0 , F ), where Q is a
presented in [4]. An algorithm using finite automata
                                                                nonempty finite set of states, A is an input alphabet,
approach to computation of all exact covers was intro-
                                                                δ is a transition function, where δ : Q × A 7→ P(Q),
duced in [5].
                                                                q0 ∈ Q is an initial state and F ⊆ Q is a set of final
     The algorithm presented in [2] searches for one re-        states. It is denoted by NFA.
stricted smallest approximate cover (i.e. cover with the            A state q is a successor of state p of a deterministic
smallest distance), using dynamic programming. An               finite automaton (Q, A, δ, q0 , F ) if q = δ(p, a) for some
algorithm using finite automata approach to compu-              a ∈ A. A state qN is a successor of a state pN of a
tation all restricted approximate covers for Hamming,           NFA (QN , A, δN , q0N , FN ) if q ∈ δN (pN , a).
Levenshtein, and Damerau distance was introduced in                 String w = a1 a2 . . . a|w| is said to be accepted by a
[3].                                                            DFA (Q, A, δ, q0 , F ) if there exists a sequence
    This paper is organized as follows. In Section 2,           δ(q0 , a1 ) = q1 , δ(q1 , a2 ) = q2 , . . . , δ(q|w|−1 , a|w| ) ∈ F .
some notations and definitions used in this paper are           String w = a1 a2 . . . a|w| is said to be accepted by a
described. In Section 3, the algorithm for the problem          NFA (Q, A, δ, q0 , F ) if there exists a sequence
is presented. In Section 4, the complexities of the al-         δ(q0 , a1 ) = Q1 , δ(q1 , a2 ) = Q2 , . . . , δ(q|w|−1 , a|w| ) ⊆ F
gorithm are proven. In Section 5, experimental results          for some q1 ∈ Q1 , . . . , q|w|−1 ∈ Q|w|−1 . A language
are shown.                                                      accepted by a finite automaton M is denoted by L(M ).
    A left language of a state q of a nondeterminis-                   there exists string s ∈ Suff (T ) such that DH (w, s) ≤
tic finite automaton (Q, A, δ, q0 , F ) is a set of strings            k.
w = a1 a2 . . . a|w| , where for each w exists a sequence                  A nondeterministic Hamming suffix automaton M
δ(q0 , a1 ) = Q1 , δ(q1 , a2 ) = Q2 , . . . , δ(q|w|−1 , a|w| ) =      for a string T and distance k is such nondeterministic
Q|w| , q ∈ Q|w| for some q1 ∈ Q1 , . . . , q|w|−1 ∈ Q|w|−1 .           finite automaton without ε–transitions, that L(M ) =
A left language of a state q of a DFA (Q, A, δ, q0 , F )               {w : DH (w, s) ≤ k, s ∈ Suff (T )}. Such an automaton
is a set of strings w = a1 a2 . . . a|w| , where for each w            M = (Q, A, δ, q0 , F ) may be constructed in this way:
exists a sequence
δ(q0 , a1 ) = q1 , δ(q1 , a2 ) = q2 , . . . , δ(q|w|−1 , a|w| ) = q.   1. Create a layer of |T | + 1 states:
    A maxfactor of a state q of a DFA (Q, A, δ, q0 , F )                  (a) each state qi0 corresponds to a position i in T
is the longest string of left language of q, denoted by                       (plus initial state q0 , thus 0 < i ≤ |T |),
maxfactor (q). A depth of a state q of a DFA is the                       (b) for each state qi0 (but the last q|T  0
                                                                                                                      | ) define tran-
length of maxfactor (q), denoted by depth(q).                                             0             0
                                                                              sition δ(qi , T [i]) = qi+1 ,
    A DFA MD = (Q, A, δ, q0 , F ) is equivalent to a                      (c) define the last state q|T   0
                                                                                                            | final (note that until
NFA MN = (QN , A, δN , q0N , FN ) if L(MN ) = L(MD ).                         now such automaton accepts exactly T ).
Subset construction may be used:
                                                                       2. Similarly, create a layer for each “number of er-
 1. Set Q = {{q0 }} will be defined, state q0 = {q0N }                    rors” l, 1 ≤ l ≤ k (only exception: we do not need
    will be treated as unmarked.                                          any state qil for l > i).
 2. If each state in Q is marked then continue with                    3. For each state qil (but the last q|T | in each layer
    step 4.                                                               and but the last layer) and for each symbol a ∈
                                                                          A, a 6= T [i] (not occurring in T at position i), de-
 3. Unmarked state q will be chosen from Q and the                                                           l+1
                                                                          fine transition δ(qil , T [i]) = qi+1  .
    following operations will be executed:
                   S                                                   4. Create “long” transitions from q0 : δ(q0 , a) = {qi0 :
    (a) δ(q, a) = δN (pN , a) for pN ∈ q and for all
                                                                          a = T [i], a ≤ i ≤ |T |}∪{qi1 : a 6= T [i], 1 ≤ i ≤ |T |}.
         a ∈ A,
    (b) Q = Q ∪ δ(q, a) for all a ∈ A,                   For example of a transition diagram of a nondetermin-
    (c) state q ∈ Q will be marked,                      istic Hamming suffix automaton see Fig. 1.
    (d) continue with step 2.                                A level of a state of a nondeterministic Hamming
 4. F = {q : q ∈ Q, pN ∩ FN 6= ∅, pN ∈ q}.               suffix automaton corresponds to the number of errors,
                                                         a depth of a state of this automaton is equal to the
Using subset construction of MD equivalent to MN , corresponding position in T .
every state qD ∈ Q corresponds to some subset of QN .
This subset is called a d–subset, denoted by d(qD ). Definition 1 (Restricted approximate cover). Let
Each element of the d–subset corresponds to some T and w be strings. We say, that w is a restricted ap-
state of QN . Where no confusion arises, depth of a proximate cover of T with Hamming distance k if w is
state corresponding to an element rj ∈ d(qD ) of d– a factor of T and there exist strings s1 , s2 , . . . , sr (all
subset d(qD ) is simply denoted by rj , as numeric rep- some substrings of T ) such that:
resentation of rj corresponds to the depth. In the algo-
rithms below, d–subset is supposed to be implemented 1. DH (w, si ) ≤ k for all i where 1 ≤ i ≤ r,
as a list, preserving order of its elements. An element 2. T can be constructed by superpositions and con-
of the d–subset is denoted by ri , where the subscript        catenations of copies of the strings s1 , s2 , . . . , sr .
i means an index (order) of the element ri within the
d–subset.                                                Note 1. An approximate cover is more general regu-
    A distance is the minimum number of editing op- larity than restricted approximate cover, because (un-
erations that are necessary to convert a string x into restricted) approximate cover of T needs not be a fac-
a string y. The maximum allowed distance is denoted tor of T . In this paper, only restricted approximate
by k.                                                    cover is considered.
    The Hamming distance between strings x and y is
equal to the minimum number of editing operations Definition 2 (Restricted smallest distance ap-
replace that are necessary to convert x into y. The proximate cover). Let T and w be strings. We say,
Hamming distance function is denoted by DH .             that w is a restricted smallest distance approximate
    String w ∈ A∗ is an approximate prefix of a string cover of T with distance k if w is a restricted approx-
T ∈ A∗ with the maximum Hamming distance k if imate cover of T with the distance k and there exists
there exists string p ∈ Pref (T ) such that DH (w, p) ≤ no l < k such that w is a restricted approximate cover
k. String w is an approximate suffix of the string T if of T with the distance l.
Problem 1 (All restricted smallest distance approximate computed. When q represents an approximate prefix,
covers of a string). Given string T over alphabet A, its successors are recursively generated and processed.
Hamming distance function DH and distance k, find Note that the set of final states of the deterministic
all restricted approximate covers of T and their small- approximate cover candidate automaton is not needed
est distances. A set of all restricted smallest distance (it would contain all states having d–subsets contain-
approximate covers of string T under Hamming dis- ing element corresponding to some final state of the
tance k is denoted by coversH k (T ).                         nondeterministic Hamming suffix automaton).
                                                                  Distance l of each cover w = maxfactor (q) may
    As any approximate cover of a string T under Ham-
                                                              vary between 0 and k. Moreover, it cannot be less than
ming distance is an approximate prefix and an approx-
                                                              level of the first or the last element of d(q), because
imate suffix of T (proven in [3]), an automaton accept-
                                                              each cover must be an approximate prefix and suffix.
ing only such strings can be used.
                                                              Of course, it cannot be more than the maximum level
Definition 3 (Approximate cover candidate au- of elements of d(q). The Algorithm 1 removes all the
tomaton). An approximate cover candidate automa- elements having the maximum level but the first and
ton (Q, A, δ, q0 , F ) for string T ∈ A∗ , Hamming dis- the last element of d(q), and tries whether w covers T
tance function DH and the maximum distance k ac- without those removed positions.
cepts set W = {w1 , w2 , . . . , wl } of factors of T , where
for each wi ∈ W holds:
                                                              Algorithm 1 Smallest distance of a cover of T .
 1. there exists p ∈ Pref (T ) such that DH (p, wi ) ≤ k,
                                                              Input: d–subset d(q) representing a cover w of T .
     and                                                      Output: The smallest distance l of w.
 2. there exists s ∈ Suff (T ) such that DH (s, wi ) ≤ k.
                                                              1: lmin ← max{level (r1 ), level (r|d(q)| )}
    In [3], a construction of an automaton accepting          2: lmax ← maxr∈d(q) {level (r)}
                                                              3: l ← lmax
intersection of approximate prefixes and approximate
                                                              4: repeat
suffixes is used for construction of a deterministic ap-
                                                              5:    for all r ∈ d(q) \ {r1 , r|d(q)| } : level (r) = l do
proximate cover candidate automaton. Although this            6:      remove r from d(q)
is a straightforward idea, specialized method (more           7:    end for
effective) is presented for Hamming distance in the           8:    l ←l−1
following section.                                            9: until l ≥ lmin and for all i = 2, 3, . . . , |d(q)| : ri −ri−1 ≤
                                                                 depth (q)
                                                             10: l ← l + 1.
3    Problem solution
The principle of the solution is following: first, we per-
form a subset construction of a deterministic cover
candidate automaton from a nondeterministic Ham-             Example 1. Let us have a string T = aabccccb over
ming suffix automaton for string T and k, as every           alphabet A = {a, b, c} and let us compute a set of all
d(q) represents a set of positions of w = maxfactor (q)      restricted smallest distance approximate covers of T
within T . If we treat with d(q) as with a sorted list       under Hamming distance k = 2 using Algorithm 3.
(ordered by depths of its elements), each pair of sub-
sequent elements represents positions of subsequent              Because of the distance 2, we are interested in cov-
occurrences of w within T . When for such positions          ers of length at least 3 or having distance less than 2.
i, j, i < j holds j − i > |w|, we know that w cannot         We construct a nondeterministic Hamming suffix au-
be a cover of T . The distance of w is the minimum l         tomaton MS (see Fig. 1), then an approximate cover
such that it is possible to remove all elements r ∈ d(q)     candidate automaton M is analysed (see Fig. 2).
having level (r) > l and the previous condition holds.           Looking at the d–subset {3, 4′′ , 8′′ }, it represents
     In fact, it is not necessary to save complete de-       an approximate prefix and suffix aab of length 3, but
terministic automaton. Unlike in [3], we do not make         for its positions holds 8 − 4  3, thus the factor aab is
construction of the deterministic cover candidate au-        not an approximate cover of T with Hamming distance
tomaton and subsequent computation of covering. A            2. Looking at the other d–subset {3′′ , 5′′ , 6′ , 7′ , 8}, it
depth–first search algorithm is used to perform subset       represents factor ccb, that covers T with Hamming dis-
construction and computation of covering and of the          tance 2. It is checked whether it covers T with distance
distance of each cover: in Algorithm 2, for each state       1 (Alg. 1). As the first element of the d–subset has level
and symbol, a successor q is generated, it is deter-         equal to 2, lmin is equal to 2. The resulting set of the
mined whether it represents a cover and the distance is      covers is coversaabccccb H (2 ) = {(ccb, 2), (aabccccb, 0)}.
                     a                                          a    ′ ′′ ′′ ′′ ′′ ′′ b                                         c                   c              c            c       b
               0              123′ 4′ 5′ 6′ 7′ 8′               b 23 4 5 6 7 8                                  34′′ 8′′                45′′             56′′           67′′        7       8
                              b                                                          c                                      c                   c            ′′ c
                                                                  2′ 34′′ 5′′ 6′′ 7′′ 8′                     3′′ 45′′ 6′′ 7′′        4′′ 56′′ 7′′       5′′ 67          6′′ 7
                                                                c
                              1′ 2′ 34′ 5′ 6′ 7′ 8                     2′′ 3′ 45′ 6′ 7′ 8′′
                               c                                       2′′ 3′ 4′′ 5′ 6′ 7′ 8
                                                                b
                                                                c                                        b
                              1′ 2′ 3′ 45678′                              2′′ 3′′ 4′ 5678′                  3′′ 5′′ 6′ 7′ 8

Fig. 2. Transition diagram of complete deterministic approximate cover candidate automaton for string T = aabccccb
and the maximum Hamming distance 2



Algorithm 2 Process state of a deterministic approx-                                                                       Algorithm 3 Computation of a set of all restricted
imate cover candidate automaton M = (Q, A, δ, q0 , F )                                                                     smallest distance approximate covers for string T and
constructed for string T and the maximum distance                                                                          the Hamming distance k.
k from a nondeterministic Hamming suffix automaton                                                                         Input: String T = a1 a2 . . . an , the Hamming distance k.
MS = (QS , A, δS , q0S , FS ).                                                                                             Output: Set of all restricted smallest distance ap-
Input: State qi having depth i and the d–subset d(qi ).                                                                    proximate covers coversH k (T ) of string T using the
Output: The temporary set of restricted smallest distance                                                                  Hamming distance function DH and the distance
approximate covers c.                                                                                                      k.
 1: c ← ∅                                                                                                                   1: coversH k (T ) ← {(T, 0)}.
 2: for all a ∈ A do                                                                                                        2: Construct nondeterministic Hamming suffix automa-
 3:   create new state q, define depth(q) = depth(qi ) + 1                                                                     ton MS = (QS , A, δS , q0S , FS ) for T and k.
 4:   for all rs in d(qi ) (in order as stored in d(qi )) do                                                                3: Create state q0 of the deterministic approximate cover
 5:      append all ri ∈ δS (rs , a) to d(q) in ascending or-                                                                  candidate automaton M (T ) = (Q, A, δ, q0 , F ).
         der by depth(ri )                                                                                                  4: Define maxfactor (q0 ) = ε.
 6:   end for                                                                                                               5: Process state q0 using Algorithm 2.
 7:   if for the first r1 ∈ d(q) holds r1 ≤ depth (qi ) then                                                                6: coversH k (T ) is the resulting set from the previous step.
 8:      if exists r ∈ d(q) where level (r) = 0 within MS
         then
 9:         define w = maxfactor (q) = maxfactor (qi ).a                                                                   4        Complexities
10:         if r|d(q)| ∈ FS then
11:            if for all i = 2, 3, . . . , |d(q)| : depth(ri ) −                                                          Lemma 1. The nondeterministic Hamming suffix au-
               depth(ri−1 ) ≤ depth (q) then                                                                               tomaton MS = (Q, A, δ, q0 , F ) for string T and the
12:               define l the smallest distance of w (Alg. 1)                                                                                                           2

13:               if |w| > k or l < |w| then                                                                               distance k contains (|T | + 1) · (k + 1) − k 2+k states and
                                                                                                                                                           2
14:                  c ← c ∪ (w, l)                                                                                        |A| · (|T | · (k + 1) − 1 + k−k
                                                                                                                                                         2 ) + |T | − k + 1 transitions.
15:               end if
16:            end if                                                                                                      Proof. The automaton consists of layers of states q (i)
17:         end if                                                                                                         for each level i. The layer of states q 0 contains |T | +
18:         process state q (this algorithm), c′ is result                                                                 1 states. Each layer of states q (i) contains one state
19:         c ← c ∪ c′                                                                                                     less in comparison with layer of states q (i−1) , thus it
20:      end if                                                                                                            contains |T | − i + 1 and layer of states q (k) contains
21:   end if                                                                                                               |T | − k + 1 states.
22: end for                                                                                                                     The automaton contains |A| transitions from each
                                                                                                                           state, with some exceptions. There are k+1 final states
                                                                                                                           having no successor. In the layer of states q (k) , each
               a          a
                              a b
                                     cb
                                       c
                                           c
                                             c
                                                 c
                                                   c
                                                       c
                                                         c
                                                             b
                                                               b                                                           state has only one successor. From the initial state,
           0        1          2 3       4     5     6     7     8
                   b, c         a, c
                              b, c     a, b a, b a, b a, b a, c                                                            there are |T | transitions defined to the states q (0) hav-
                   1′
                          a ′ b ′ c ′ c ′ c ′ c ′ b ′
                            2    3      4      5     6     7    8                                                          ing level (q (0) ) = 0 and |T | · (|A| − 1) transitions to the
                           b, c a, c   a, b a, b a, b a, b     a, c
                                                                                                                           states q (1) having level (q (1) ) = 1. Thus in MS there are
                               b, c       a, c       a, b       a, b        a, b       a, b       a, c                     |Q|·|A|+|T |·|A|−(k +1)·|A|−(|T |−k +1)·(|A|−1) =
                                      b          c          c          c           c          b
                              2′′         3′′        4′′        5′′         6′′        7′′        8′′                      |A| · (|Q| − 2) + |T | − k + 1 transitions.

Fig. 1. Transition diagram of nondeterministic Hamming                                                                     Note 2. As restricted approximate covers of string T
suffix automaton for string aabccccb and the distance 2                                                                    are exact factors of T , it is meaningful to consider
                                                                                                                           effective alphabet A only, thus |A| ≤ |T | always holds.
It is also meaningless to consider large k, because every(QS , A, δS , q0S , FS ), there is no need for any additional
factor of T having length less or equal to k is always   data structures. For the purpose of the construction of
approximate cover of T . Thus k ≤ |T | always holds.     the deterministic cover candidate automaton M , only
     Usually, k ≪ |T | and |A| ≪ |T | (e.g. in DNA anal- the set of states and transitions from q0S need to be
ysis, A = {a, c, g, t}). Therefore k and |A| may be con- preserved, because the rest may be computed later in
sidered as small constants independent of |T |.          O(1) time and space using knowledge of a depth and
                                                         a level of a state, k, and T . Thus the space complexity
Lemma 2. The deterministic approximate cover can- of this construction is O((k + |A|) · |T |).
didate automaton M for string T and the Hamming              During the computation of the smallest distance
                                 2
distance contains at most |T | 2+|T | + 1 states.        (Algorithm 1), only O(1) additional data is needed.
                                                         During the processing of states of M (Algorithm 2),
Proof. Each d–subset d(q) of M contains at least one r the needed space is limited by the number of elements
such that level (r) = 0, thus maxfactor (q) ∈ Fact (T ). of all d–subsets (Lemma 4) preserved in a memory and
The number of possible factors of length depth(q) is by the number of all approximate covers (the result,
at most |T | − depth(q) + 1, thus the maximum num- limited by the number of all factors of T – at most
ber of states of M having equal depth is also |T | − O(|T |2 )).
depth(q) + 1. The automaton M also contains an ini-
tial state. Therefore, the number of states of M is at Lemma 5. Using Algorithms 2, and 3 for construc-
most (|T |−1+1)+(|T |−|T |+1)
                              · |T | + 1.                tion of a deterministic cover candidate automaton M =
                 2
                                                         (Q, A, δ, q0 , F ) from a nondeterministic Hamming suf-
Lemma 3. During the construction of the determin- fix automaton MS = (QS , A, δS , q0S , FS ), all d–subsets
istic cover candidate automaton M for string T , Al- are sorted in ascending order by depths within MS .
gorithms 2, 3 need to hold at most |T | + 2 states at a Proof. Having p, q ∈ Q\{q0 } such that q is a successor
time.                                                    of p, suppose that d(p) is sorted in order by depths
                                                            within MS . It holds that for any pS , qS ∈ QS such
Proof. Algorithm 2 works as a depth–first search al-
                                                            that qS is a successor of pS , depth(qS ) > depth(pS ).
gorithm. For each state and symbol it generates at
                                                            Therefore d(q) constructed from already sorted d(p) is
most one state – possible successor. Thus it holds at
                                                            also sorted.
most |T | + 1 states of M (|T | states having d–subsets
                                                                For p = q0 , it is supposed that δS (q0S , a) is con-
representing exact prefixes of T plus initial state) and
                                                            structed as sorted in order by depths within MS .
a state generated for a final state, having empty d–
subset.                                                     Lemma 6. Time complexity of Algorithm 1 is O(k ·
                                                            |T |) for each state.
Lemma 4. During the construction of the determin-
istic cover candidate automaton M for string T , Algo- Proof. Algorithm 1 may remove some elements of a
                                     2                    d–subset in each iteration, thus the iteration may take
rithms 2, 3 need to hold at most |T | 2+|T | + 1 elements
                                                          O(|T |) time. The number of iterations may be at most
of d–subsets at a time.
                                                          k.
Proof. Alg. 2 needs at most |T |+ 2 states in a memory      Lemma 7. Time complexity of Algorithm 2 (from the
at a time (Lemma 3). The deterministic cover candi-         initial state) is O((k + |A|) · |T |3 ).
date automaton M = (Q, A, δ, q0 , F ) is constructed
                                                            Proof. Algorithm 2 constructs for all states q and all
by subset construction from a nondeterministic Ham-
                                                            a ∈ A the d–subsets of all possible successors of q.
ming suffix automaton MS = (QS , A, δS , q0S , FS ). In
                                                            The number of states is O(|T |2 ) (Lemma 2) and the
MS , each state but q0S has at most one successor
                                                            number of elements of each d–subset is O(|T |). For
for each symbol, q0S has |T | successors for each sym-
                                                            each state, the computation of covering is performed
bol. For each state pS and its successor qS in MS
                                                            (it takes O(|T |)), and for each cover (their number is
holds: depth(qS ) > depth(pS ). The longest possible
                                                            O(T 2 )), the computation of the smallest distance is
d–subset d(p) contains r|T | having depth(r|T | ) = |T |,
                                                            performed (it takes O(k · |T |) for each cover – Lemma
and r1 having depth(r1 ) = 1. As |δS (r1 , a)| ≤ 1 and
                                                            6).
δS (r|T | , a) = ∅ for every a ∈ A, for state p and its
successor q in M holds: |d(q)| ≤ |d(p)| for p 6= q0 and     Theorem 2. Time complexity of Alg. 3 is O((k+|A|)·
|d(q)| ≤ |T | for p = q0 .                                  |T |3 ).
                                                            Proof. It clearly holds that construction of the nonde-
Theorem 1. Space complexity of Alg. 3 is O(|T |2 ).
                                                            terministic Hamming suffix automaton takes O((k +
Proof. It clearly holds that for construction of the        |A|) · |T |). Construction of the deterministic cover can-
nondeterministic Hamming suffix automaton MS =              didate automaton takes O((k + |A|) · |T |3 ) (Lemma 7).
                                                                                                                                 Athlon, for k=101 and k=201
5     Experimental results                                                                                80



The algorithm was implemented in C++ using STL,                                                           60

the program was compiled using the GNU C++ com-




                                                                                             Time [sec]
piler with O3 optimizations level. The dataset used to                                                    40

test the algorithm is the nucleotide sequence of Sac-
charomyces cerevisiae chromosome IV1 . The string T                                                       20

consists of the first |T | characters of the chromosome.
    The first set of tests was run on a AMD Athlon                                                         0
                                                                                                           100       200        300          400          500           600     700
64 3200+ (2200 MHz) system, with 2.5 GB of RAM,                                                                                           Text length

under Fedora Linux operating system (see Figs. 3, 4).
                                                                                        Fig. 5. Time consumption with respect to the text size
                                                                                        (solid line for k = 101, dotted one for k = 201)
                                        Athlon64, for k=11 and k=31
                   12

                                                                                                                                Athlon, for |T|=114 and |T|=153
                   10
                                                                                                          2.5

                   8
                                                                                                          2.0
      Time [sec]




                   6
                                                                                                          1.5


                                                                                             Time [sec]
                   4

                                                                                                          1.0
                   2


                   0                                                                                      0.5
                        0   100   200         300        400          500   600   700
                                                Text length
                                                                                                          0.0
                                                                                                                0   20     40         60     80       100         120     140   160
                                                                                                                                       Maximum distance
Fig. 3. Time consumption with respect to the text size
(solid line for k = 11, dotted one for k = 31)
                                                                                        Fig. 6. Time consumption with respect to the maximum
                                                                                        distance (solid line for |T | = 114, dotted one for |T | = 153)




                   1400
                                    Athlon64, for |T|=1162 and |T|=1550                 vs. maximum 1.0 second for text length 114 – see Fig.
                                                                                        6).
                   1200

                   1000

                                                                                        6    Conclusion and future work
      Time [sec]




                    800

                    600

                    400                                        In this paper, we have shown that an algorithm de-
                    200                                        sign based on a determinisation of a suffix automa-
           0
                                                               ton is appropriate for all restricted smallest distance
             0   200  400  600     800     1000 1200 1400 1600
                             Maximum distance                  approximate covers of a string problem for Hamming
                                                               distance. The presented algorithm is straightforward,
Fig. 4. Time consumption with respect to the distance easy to understand and to implement and its theoreti-
(solid line for |T | = 1162, dotted one for |T | = 1550)       cal and experimental time requirements are compara-
                                                               ble to the existing approach ([2]).
                                                                   The algorithm may be extended to work with other
                                                               distance functions, possibly using the idea presented
    The second set of tests was run on a AMD Athlon
                                                               in [3]. Theoretical and experimental analysis similar
(1400 MHz) system, with 1.2 GB of RAM, under Gen-
                                                               to one presented here may be accomplished. The algo-
too Linux operating system (see Figs. 5, 6).
                                                               rithm may be also extended to use parallelism.
Note 3. In comparison with experimental results pre-
sented in [2], the algorithm presented in this paper
                                                                                        Acknowledgements
runs a bit faster for the same data, even on a slightly
slower computer (1.3 seconds in [2] for text length 100
                                                                                        This research was partially supported by the Ministry
1
    The Saccharomyces cerevisiae chromosome IV dataset                                  of Education, Youth, and Sport of the Czech Repub-
    could be downloaded from http://www.genome.jp/.                                     lic under research program MSM 6840770014, by the
Czech Science Foundation as project No. 201/06/1039,
and by the Czech Technical University in Prague as
project No. CTU0803113.


References
1. Apostolico, A., Farach, M., and Iliopoulos, C. S.
   Optimal superprimitivity testing for strings. Inf. Pro-
   cess. Lett. 39, 1 (1991), 17–20.
2. Christodoulakis, M., Iliopoulos, C. S., Park, K.,
   and Sim, J. S. Implementing approximate regulari-
   ties. Mathematical and Computer Modelling 42 (Octo-
   ber 2005), 855–866.
3. Guth, O. Searching approximate covers of strings us-
   ing finite automata. In Proceedings of POSTER (2008),
   Faculty of Electrical Engineering, Czech Technical Uni-
   versity in Prague.
4. Smyth, W. F. Approximate periodicity in strings. Util-
   itas Mathematica 51 (1997), 125–135.
5. Voráček, M., and Melichar, B. Searchig for regu-
   larities in generalized strings using finite automata. In
   Proceedings of the International Conference on Numer-
   ical Analysis and Applied Mathematics (2005), WILEY
   – VCH Verlag, pp. 809–812.