-

Structured Motifs Recognition in DNA sequences

Yuridia P. Mej´ıa

Ivan Olmos

ivanoprklg@gmail.com 0

Jesus A. Gonzalez

jagonzalez@inaoep.mx 1 0 4 sur y Av. San Claudio , Ciudad Universitaria, Puebla, Me ́xico 1 Instituto Nacional de Astrof ́ısica , O ́ptica y Electro ́nica, Luis Enrique Erro No. 1, Sta. Mar ́ıa Tonantzintla, Puebla, Me ́xico

In this paper is presented a methodology for structured motifs recognition (SMR) in DNA sequences. The SMR problem consists of finding all instances of a triple-pattern PL PC PR in a DNA sequence, where PL, PC and PR are based on the IUPAC alphabet, and PL and PR are both separated from PC by a distance no greater than ”n” characters, which is provided as input. In this problem an inexact association between PL, PC , PR and the DNA sequence is allowed, which is limited by an error based on the number of insertions, deletions, and substitutions operations. In this paper we propose a methodology for finding SMR patterns based in two stages: first, an automaton is used to search all PC instances (where only substitutions are only allowed); second, a dynamic programming technique is proposed to find the PL and PR patterns (where substitutions, insertions and deletions, are allowed) based on the Levenshtein algorithm. This methodology is useful to biologists in real DNA patterns recognition tasks, where it is necessary to find DNA regions with a biological meaning.

The DNA motif search problem consists of finding a pattern P (the motif to search for) from a text T (the DNA database created from the nucleotides alphabet: ”A” for adenine, ”C” for cytosine, ”G” for guanine, and ”T” for thymine) where we output the positions in T where P appears, the instances of P in T . In this problem, P is formed from an extended alphabet defined under the IUPAC (International Union of Pure and Applied Chemistry) rule where a character in this alphabet may represent more than one nucleotide, this means that P may represent a set of patterns to search for in T .

In Biology, the motif search problem is very important, because it allows us to search or discover biological segments from an organism that is being analyzed. In general, such motifs to search for are restricted to single motifs. However, there exist biological problems where it is necessary to find motifs that are formed by two or more single motifs (structured motifs, or SM for short), which are separated by a distance defined by the biologist. In this work, we consider that a SM is formed by two or three single motifs, represented by PL PC or PL PC PR respectively. In the biological field, PC is called a ”central pattern or motif”, and PL, and PR are called satellites, mini-satellites or micro-satellites, depending of their length. In both cases, and based on biological restrictions, the SM recognition problem always start search of the PC motif, which has an important meaning for the organism that is being analyzed. Because of its importance, at the moment of searching PC in a DNA database, it is only possible to allow minimal inexact associations, limited by a threshold in the number of substitutions of characters considering the IUPAC alphabet. The number of such substitutions (the threshold) is defined by the biologist. On the other hand, at the moment of searching a satellite, more flexibility is allowed. This means that a satellite could be found after a number of insertions, substitutions, and deletions in the DNA database has been applied (considering a larger threshold).

The problem addressed in this paper consists of finding motifs with the structure PL PC PR, where PL is located at the left of PC and PR at the right of PC , and both of them are located at a distance (#bases) no longer than d1 and d2 respectively, which are provided as input parameters. As a solution to this problem, we propose a two phases methodology: at the beginning we build an automaton that searches all instances of PC in a DNA database, reporting their positions; after that, we propose a dynamic programming strategy based on the Levenshtein algorithm, which search possible PL and PR satellites in the DNA database restricted to a distance d1 and d2 respectively.

The paper has the following structure: in section 2 we introduce important notation used in this work. In section 3 we present the problem to be addressed. Our proposal for finding the structured motifs is described in section 4. Finally, conclusions and future work is presented in section 5. 2

Definitions and notation

In this section, we introduce the notation on strings, sequences, and automata, that we use along this work.

An alphabet, denoted by is a finite nonempty set of letters. A string on alphabet is a finite subset of elements on , one letter after another. As an example, if = fA; C; G; T g then, examples of strings on this alphabet are AAC,ACC, and AAG. We denote by " the zero letter sequence, called the empty string. The length of a string x is defined as the number of letters on x, denoted by jxj. With x[i], i = 1; : : : ; jxj, we denote the letter at position i on x. A substring of a string is a string where: j j j j, and [i] = [i + k], i = 1; : : : ; j j and 0 k j j j j.

The concatenation of two strings x and y is the string composed of the letters of x followed by the letters of y, denoted by xy. A string is a prefix of the string x, denoted by @ x, if x = y. On the other hand, is a su x of x, denoted by A x, if x = y . The empty string " is both a prefix and a su x of every string (not empty).

In this work we denote a finite automaton M as a 5-tuple (Q; q0; A; ; ) where: Q is a finite set of states, q0 2 Q is the start state, A Q is a distinguished set of accepting states, is a finite input alphabet, and : Q x ! Q is the transition function of M.

Let P be a pattern (string) to search for. In biochemistry, P is a string called a motif, wherein all of its letters are defined by the union of two alphabets: a main alphabet

B = fA; C; G; T g (every letter represents a nuceotide as described before), and the IUPAC alphabet, (denoted as the extended alphabet in this work) used to represent ambiguities in the pattern. The IUPAC established an ambiguity alphabet, represented by E = fR; Y; K; M; S ; W; B; D; H; V; Ng, where R = fG; Ag, Y = fT; Cg, K = fG; T g, M = fA; Cg, S = fG; Cg, W = fA; T g, B = fG; T; Cg, D = fG; A; T g, H = fA; C; T g, V = fG; C; Ag, and N = fA; G; C; T g.

Let X be a string, where X = AW B, A and B are substring of X separated by the string W. The distance between two strings A and B is denoted by d(A;B), where d(A;B) = jWj.

A match between two strings A and B is a process where each character of A is associated with a character of B. In this paper, we define three di erent associations (matchings) between two strings: exact matching, exact set matching, and inexact matching (for the sake of simplicity, consider the strings X; Y, and Z, where X and Y are defined in B, and Z is defined in B S E): T CAAG and Y = T CAG (jXj > jY j), it is necessary to remove a character from X. If X = , = T C, = A and = AG, then it is possible to remove from X, generating a new string X0, such that X0 = Y .

It is usual that the total number of insertions / deletions / substitutions in the inexact matching process is limited to a percentage (permissible error) defined by the user. In this work, we define the error of an inexact matching between two strings X and Y , denoted by (X;Y), as follows: (X;Y) = (#insY + #subY + #delY )=jXj, where Y is a string derived from X through insertions (#insY ), substitutions (#subY ) and deletions (#delY ).

Finally, we introduce the concept of a structured motif. This term is introduced in this work because it is based on the composition of simple motifs. Formally, a structured motif is a string X = PLW PC Z PR, where jWj = d1, jZj = d2 (this means that PL and PR are located at a distance from PC no longer than d1 and d2 respectively), PL is the left satellite, PC is the central pattern, PR is the right satellite, and PL, PC and, PR are defined in B S E .

As an example, consider the string X = A G T G A C G A C T C A, where PC = ACG, PL = T G, PR = T C and d1 = 3 and d2 = 4. Then, in X we find a PC in position 5 - 7, a PL at position 3 (with d1 = 0) and PR at position 10 (with d2 = 2).

In Section 3, we introduce a detailed description of our problem. 3

Problem Description

In this section we describe in detail the input and output of the SMR problem that is considered in this work.

First, we describe the input of the problem. Let T be a DNA database, PL, PC , and PR three motifs to search for, where PC is a motif called ”central pattern”, PL and PR called satellites (left and right respectively), two distances d1 (the distance between PL and PC ) and d2 (the distance between PC and PR), and a percentage error (#mismatches allow in a matching process between a substring of T and a satellite). Based on these input parameters, the output of the SMR problem consists of finding all substrings S of T , such that: 1. There exists a substring S of T such that S PC , and 2. There exists a left substring S 0 of T at a distance d1 of S such that S 0 3. There exists a right substring S 00 of T at a distance d2 of S such that S 00 4. S 0 and S 00 must both be generated within an error , ie, (PL;S 0) (PR;S 00) .

PI , and

PD, and and

It is important to mention that this problem has some variations based on the input parameters. These variants are described bellow: – If the input of the SMR problem is: T , PL, PC , d1 and , then we need to find all instances S and S 0 from T such that S PC , S 0 PI , d(S ; S 0) d1 and (PL;S 0) .

This problem is called PL PC . – If the input of the SMR problem is: T , PC , PR, d2 and , then we need to find all instances S and S 0 from T such that S PC , S 0 PR, d(S ; S 0) d1 and (PR;S 0) .

This problem is called PC PR. – If the input of the SMR problem is: T , PL, PC, PR, d1, d2 and , then we need to find all instances S , S 0 and S 00 from T such that S PC, S 0 PL, S 00 PR, d(S ; S 0) d1, d(S ; S 00) d2, (PL;S 0) , and (PR;S 00) . This problem is called PL

For the sake of simplicity, in this paper we describe the solution of the PL - PC problem. Note that in order tho solve the other two versions of the problem, we only need to change the orientation of the distances, or combine the results of PL - PC and PC - PR. In the next section we introduce a proposal to solve the PL - PC problem. 4

Proposal

With the aim to solve the PL - PC problem, we propose a method divided in two phases: – We first implement a searching phase, where all substrings S of T , S PC are located. As result of this phase, we obtain a set C = fS : S is a substring of T , and S PCg. To do this, we propose an automaton that searches for all instances of PC in T . At the end of this phase, all results are stored in a matrix, including the end position of each pattern. – In a second phase, all S 2 C are processed, searching if there exists a substring S 0 of T (satellite), such that S 0 PL and d(S 0; S ) d1. Since insertions, deletions, and substitution operations are needed in this phase, we propose a technique based on a dynamic programming strategy. As output, this phase reports all central patterns with their left-satellites.

We proceed to explain in detail the way in which each phase is implemented. 4.1

Searching the Central Patterns As we mentioned above, we implement an automaton to search for all the instances of PC in T . This automaton, called MFA [ 5 ] is based on the idea of storing in a temporal memory each pattern that has previously been recognized during the searching phase [ 2 ]. In other words, this automaton implements a strategy that stores knowledge about how the pattern matches overlap with itself, avoiding the computation of the prefix that matches with the last largest su x. As result, in the searching phase, each character in T is examined exactly once (linear time with respect to jT j). In our explanation and for the sake of simplicity, consider that PC = AMS , PL = GK and a distance d1 = 4.

The MFA automaton must be constructed from the pattern PC in a preprocessing step before it can be used to search PC in T . This preprocessing phase is divided in three stages: first, we perform a phase called expansion of PC, where all characters from PC in E are substituted by characters in B (based on the IUPAC nomenclature); then, in a second step we build a matrix that stores the states of the MFA automaton; finally, in a third phase we generate the transition matrix of the MFA automaton.

The expansion of PC is performed by a substitution process, where each character of PC in E is ”expanded” with each valid character in B, restricted to the combinations derived from the IUPAC specification. As result, we generate a set called setP, where we store each combination of the pattern created with the substitution of characters. As an example, if PC = AMS , where M = fA; Cg, S = fC; Gg, then M and S are replaced by characters in B, from right to left. As final result, seqP = fAAC; ACC; AAG; ACGg.

Based on seqP, the second phase consists on building a matrix called matQ, which stores all states of the MFA automaton. With the aim to identify with precision all the final states, this matrix is filled sequentially, from top to bottom (by rows), and from left to right (by columns), where the number of columns is equal to jPC j (each column corresponds to a character). The values assigned to the matrix start from 1, and with unit increments. Each column is assigned a number of values that is equal to the product of each character cardinality associated to the previous columns, including the current column. The last column of this matrix stores all the final automaton states. Resuming to our example, matQ for PC = AMS is filled as shown in Fig. 1.

As final step of the preprocessing phase, we generate the transition matrix of the MFA automaton. This matrix is filled row by row, its values depend of the largest su x that is prefix of some element of seqP (for further reference, see [ 5 ]).

For example, consider vertex 2 of Fig. 2 with label ”AA”, which is obtained by concatenating the letters in the path from the root to this vertex. From AA, we derive a new expansion by adding at the end each character in B, obtaining the set of strings AAA, AAC, AAT, and AAG. We then test these strings, searching their largest su x that is a prefix for some element in seqP. In this example, AAC and AAG satisfy this condition, since they are su xes of the first and third element of seqP respectively.

As next step, we compute states that are reached from vertex 2 through transitions with C and G. Each result is stored in [i; j], where i is the ith element of seqP associated to the prefix, and j is the jth element of B (used in the respective transition). Resuming to our example, from vertex 2 with transition throughout ”C” (third element of B), we obtain the string AAC, which is su x of the first element of seqP. Therefore, [2; 2] = matQ[k; l], where k is the kth element of seqP associated to the prefix, and l is the cardinality of this prefix. In this example, k = 3 (AAC is a sufix of the first element of seqP), and l = 3 (jAACj = 3). As result of this, from vertex 2 with label AA, there exists a transition with C towards vertex 4. This process is used to compute each transition for each state of the automaton.

Before we continue and for the sake of simplicity in our explanation, we enumerate in our example the rows of matQ starting from 1, and the rows of starting from 0.

At the end of the preprocessing phase, the MFA automaton obtained from our example is shown in Fig. 3.

After that we build , the next step consists of search of each pattern from setP in T . However, this phase is simple because is computed in linear time T , character by character. Consider that our results are stored in a matrix called PCS npc;2, where npc is the total number of patterns located in T . For simplicity, consider that the first column of PCS stores each pattern, and the second column the position of the last character of the pattern in T .

As an example, if T = AT GGACAACC and is the automaton shown in Fig. 3, we obtain the output illustrated in Fig 4. In this example, the circles are used to mark the final states, which represent patterns found in the searching phase. In this example, the found patterns are AAC and ACC. These results are stored in the PCS matrix illustrated in Fig. 5. The next step in our methodology consists of a process where based on a string S in PCS , we search for substrings S 0 of T where S 0 PL, and d(S 0; S ) d1.

Since PL, as the central patterns, may contain characters in E, we first replace these characters by characters in B (in this phase we could use the same substitution process explained before). Let be a set that stores each pattern generated from S .

Let PL = GK be the pattern to search for. After we replace each extended character from PL, our output is = fGG; GT g. We then search for each element of this set in a segment of T , limited by the position of the central pattern and a left distance no larger than d1. Resuming to our example where T = AT GGACAACC, PC = AMS and d1 = 4, we have two possible central patterns, AAC and ACC. If we first process AAC as PC and GG as PL, then we need to find all possible alignments as we show in Fig. 6.

As we mentioned in section 4, it is possible to use insertions, deletions, and substitutions in the searching phase of PL. The total number of these operations is limited by , which is the error allowed by the user (this value is defined as a percentage with respect to the dimension of PL).

Since this problem is related to the string alignment problem [ 2 ], we propose a solution based on a dynamic programming strategy with the Levenshtein distance [ 1 ], which determines the total number of insertions, deletions and substitutions that we need to transform a string (original pattern) into other string (target pattern).

As we see in Fig. 6, the first step consists of defining a start and end position where we will search for PL in T . This gap is called a search window, and there is a window search per each pattern in PCS . The limits of a search window are computed from the initial position of each PC in PCS , where: Posstart = Posend jPCj + 1 (Posend value is retrieved from PCS ). Based on our example, where PCS includes the central patterns AAC and ACC, we obtain the following initial positions: – AAC: Posstart = Posend – ACC: Posstart = Posend jAACj + 1 = 9 jAACj + 1 = 10

Based on Posstart, we compute the initial and final positions of each search window of each PL in , where: Wini = Posstart d1 jPLj, and Wend = Posstart 1. In our example, where PL = GK, = fGG; GT g and d1 = 4 we compute these positions per each PC : – AAC: Wini = Posstart – ACC: Wini = Posstart

Note that if we do not implement an intelligent strategy to search for possible alignments, we need to test a character more than once if there exist intersections between two or more search windows. In order to avoid this computation, we compute the intersection between two search windows. This value is stored in a variable called Mem, where

Mem = ( Wepnrdevious 0

Wsctuarrrtent if Wepnrdevious > Wsctuarrrtent

otherwise

Wepnrdevious is the end position of the previous search window and Wsctuarrrtent is the start position of the current search window. For our example of Fig. 7, we compute Mem per each central pattern as follows: – AAC: Wepnrdevious – ACC: Wepnrdevious

Wsctuarrrtent = 0 Wsctuarrrtent = 6 1, then Mem = 1 2, then Mem = 4

The next step consists of building a matrix that stores the total number of insertions, deletions, and substitutions which are needed to align PL with a segment of characters into the current search window. This matrix, denoted by D(m+1) (n+1), has m + 1 rows, where m = jS 0j (S 0 is an element of ), and n + 1 is the total number of columns (where n is the cardinality of the search window associated to the current central pattern). This matrix is filled with the Levenshtein algorithm [ 1 ]: – D[0; j] = 0, where 0 – D[i; 0] = i, where 0 – i j

n m D[i; j] = (

D[i 1; j 1] if k;i = T [ j]

D[i 1; j 1] + 1 otherwise where k;i is the ith character of the kth element of , and T [ j] is the jth character of the search window that is currently being processed. These rules are used to fill each D for each element in . When we changed of search window, we compute Mem to know if there exist columns in D that are a copy of the matrix of the previous search window (these columns are a copy starting from the second column of the current matrix D).

Resuming to our example, consider the case of ACC with GG as PL with Mem = 4. Therefore, the current matrix D is filled as follows: the first column is computed based on the Levenshtein algorithm; the next four columns are copied from the matrix of AAC with GG as PL; the last column is also filled based on the Levenshtein algorithm. This example is shown in Fig. 8.

Finally, these matrices are used to find possible alignments based on the error . The procedure is simple: we select all D[jPLj; j], where D[jPLj; j] , and j is the final position of a valid left pattern of the corresponding search window. As an example, if we consider that = 50%, then we need to find all positions where D[jPLj; j] 1, because jPLj = 2. In our example of Fig. 8, there exist three valid results for AAC and GG as left pattern: D1[2; 2] = 1, D1[2; 3] = 0, D1[2; 4] = 1; and three valid results for ACC and GG as left pattern: D4[2; 1] = 1, D4[2; 2] = 0, and D4[2; 3] = 1. With this information, we retrieve the left patterns based on T = AT GGACAACC: – Since the search window for PC = AAC and GG as left pattern is AT GGAC, then our left patterns are located at the end positions: 2 (TG is transformed in to GG with one substitution), 3 (corresponding for the pattern GG), 4 (GA, is transformed to GG with one substitution) – Since the search window for PC = ACC and GG as left pattern is T GGACA, then all patterns for this search window are the same as the previous window (they are not computed again).

As we can see in this example, our proposal has the following advantages: we can find the central patterns very fast because we avoid computing any character in T twice. With a dynamic programming strategy based on the Levenshtein algorithm, it is possible to improve the runtime taken to find the left patterns, we avoid computing common regions twice. However, it is important to see that this approach needs to generate a matrix per each possible left pattern. 5

Conclusions and Future Work

This paper presents a proposal for solving the structure motive recognition in a DNA sequence. Our approach is divided in two phases: first, we locate all central patterns with an automation, that performs this task very fast; second, we implement a dynamic programming strategy and the Levenshtein algorithm with the aim to find satellites that are located at a distance no larger than d. Actually, we implemented and tested the automaton and found that its performance is very fast, allowing us to work with large DNA databases.

We are currently implementing the dynamic programming strategy with the Levenshtein algorithm proposed in this work. After testing this phase, we will develop a tool that can be used by biologists that need to find structured patterns.

1. Bofivoj , Melichar. Approximate string matching by finite automata . In Conf. on Analysis of Images and Patterns, number 970 in LNCS. Pages 342 - 349 . 1995 .

2. Cormen , Thomas H. Introduction to Algorithms . 2nd Edition . The MIT Press and McGrawHill. Pages 805-811 , 2001 .

Maxime

Crochemore and Marie-France

Sagot

, Motifs in Sequences: Localization and Extraction. Pages 26-31 , 2000

4. Navarro , Gonzalo. A Guide Tour to Approximate String Matching . ACM Computing Surveys . vol. 33 , Issue 1. Pages. 31 - 88 , 2001 .

5. Perez , Gerardo et al. An Automaton for Motifs Recognition in DNA Sequences . To appear in MICAI 2009 , Springer - Verlag.