=Paper=
{{Paper
|id=None
|storemode=property
|title=Aligning sequences with repetitive motifs
|pdfUrl=https://ceur-ws.org/Vol-990/paper7.pdf
|volume=Vol-990
|dblpUrl=https://dblp.org/rec/conf/itat/KovacBV12
}}
==Aligning sequences with repetitive motifs==
Aligning sequences with repetitive motifs? Peter Kováč, Broňa Brejová, and Tomáš Vinař Faculty of Mathematics, Physics, and Informatics, Comenius University in Bratislava, Mlynská dolina, 842 48 Bratislava, Slovakia kovac.peter@fotopriestor.sk, {vinar,brejova}@fmph.uniba.sk Abstract. Pairwise sequence alignment is among the most intensively studied problems in computational biology. F We present a method for alignment of two sequences con- L taining repetitive motifs. This is motivated by biological studies of proteins with zinc finger domain, an important C H group of regulatory proteins. Due to their evolutionary his- Zn tory, sequences of these proteins contain a variable number C H of different zinc fingers (short subsequences with specific H T symbols at each position). T G E K P YF G E K P Our algorithm uses two types of hidden Markov models (HMM): pair HMMs and profile HMMs. Profile HMMs Fig. 1. The structure of a zinc-finger. Highly variable sites describe the structure of sequence motifs. Pair HMMs as- are marked with black color. The most conserved amino sign a probability to alignment of two motifs. Combination acids are the four involved in binding the zinc ion [14]. of the these two types of models yields an algorithm that uses different score when aligning conserved vs. variable motif residues. The dynamic programming algorithm that computes the motif alignments is based on the well known positions are very conserved due to their importance Viterbi algorithm. We evaluated our model on sequences of in assuming desired function, while other positions are zinc finger proteins and compared it with existing alterna- highly variable, since they distinguish specific DNA se- tives. quences where individual zinc fingers bind (Figure 1). We will focus our attention on the KRAB-ZNF 1 Introduction proteins that have a region encoding one or more Krüppel-associated box domains (KRAB, [2]) followed Pairwise sequence alignment is one of the most stud- by a zinc finger region (Fig. 2). The human genome en- ied problems in bioinformatics. We will concentrate on codes more that 600 of proteins from this family, and alignment of protein sequences, where a protein can a lot of effort is dedicated to building and maintaining be represented as a string over the alphabet of 20 dif- their catalogues [9], [11], [3]. Complicated repetitive ferent amino acids. During the evolution, particular structure of these genes is a result of a dynamic evo- amino acids in a protein can be substituted by an- lutionary history, full of sequence duplications [7, 12], other amino acid, or even get inserted or deleted. The and many mutations which help to gain new functions goal of sequence alignment is to compare two proteins, for duplicated copies. quantify their sequence similarity, and to identify pairs The repetitive nature of zinc finger protein of amino acids that have likely evolved from the same sequences complicates their sequence alignment. Tra- amino acid in the common ancestor. Over the years, ditional alignment methods based purely on sequence multitude of variations of this problem have been in- similarity frequently misalign individual zinc fingers, troduced and many practical software tools were de- or even align a single zinc finger in one sequence to veloped. parts of several different zinc fingers in the other se- Our work is motivated by the study of zinc finger quence. Consequently, many studies of these proteins proteins. These proteins contain a variable number of limit their analyses and infer conclusions based only up to 40 zinc finger domains [18]. Zinc finger domain is on the the KRAB domains or sequences before the zinc a stretch of approximately 28 amino acids, the purpose finger region (e.g. [14], [7]), or dispute the relevance of of which is to bind DNA at specific places. Comparison standard methods applied to genes with high variance of zinc fingers form different proteins reveals that some in the number of fingers [16]. ? This work was supported by the European Com- In this work, we develop a new method for aligning munity FP7 Marie Curie grants IRG-224885 to TV sequences with repetitive motifs, such as zinc finger and IRG-231025 to BB, and by a grant from VEGA proteins. To overcome the problems outlined above, we 1/1085/12. combine the strength of profile hidden Markov models 42 Peter Kováč et al. KRAB A align them by inserting dashes to individual sequences so that they all have the same length and when we ar- KRAB A KRAB B range them in a table, as in Figure 3, many columns contain the same or similar amino acids. Several con- Fig. 2. Domain structure of typical KRAB-ZNF genes. secutive dashes form a gap in the alignment, indicat- The protein contains one or more KRAB domains and an ing that a part of the sequence was deleted or inserted array of 3 to cca. 40 zinc fingers. [18] during the evolution. The sequence alignment prob- lem can be formulated as an optimization problem and solved by existing algorithms. For two sequences, the which are used to characterize the properties of these problem can be solved easily by Needleman-Wunsch repetitive motifs, and pair hidden Markov models as dynamic programming algorithm [10], for multiple se- a model of sequence alignments. We compare our work quences it is NP-hard [6]. The scoring function for to MotifAligner that was previously used to align zinc pairwise alignment is typically based on a substitution finger proteins [11], and we find our new method to matrix scoring all pairs of aligned amino acids and on produce more accurate alignments on our testing set. parameters for scoring gaps: gap opening penalty g for In the rest of this section, we introduce nec- the first dash in a gap and gap extension penalty e for essary background and notation and describe the each additional gap. MotifAligner approach to repetitive sequence align- ment in more detail. In Section 2, we describe our new profile-profile-pair alignment method (PPP). We ZNF626_4799/12 YKC--EECGKAF-NQSSILTTHERIILERN- present the results of experimental comparison of PPP ZNF727_4861/2 YKC--EECGKDC--RLSDFTIQKRIHTADRS and MotifAligner in Section 3. ZXDB_644/5 YQCAFSGCKKTF-ITVSALFSHNRAHFREQE LLNL1236_4814/2 SMC--PECSKTSATDSSCLLMHQRSHTGKRP ZNF23_141/15 FQC--KECGKAF-HVNAHLIRHQRSHTGEKP 1.1 Background and notation In this paper, we rely on several standard tools from Fig. 3. Alignment of five sequences of zinc finger motifs computational biology, namely alignments, pair hid- from human proteins. den Markov models, and profile hidden Markov mod- els, which we briefly explain in this section. We start by defining hidden Markov models One way of systematically deriving a scoring func- (HMMs). An HMM is a probabilistic finite state au- tion for pairwise alignments is to use pair HMMs [4]. tomaton. We can use it to generate a random sequence These models emit two sequences simultaneously. In over some alphabet as follows. We start in a desig- one step, the HMM can emit a single character in one nated start state B. In each step, we sample a charac- of the sequences or in both. The later case corresponds ter of the sequence from the emission probability dis- to two symbols aligned to each other, the former to tribution associated with the current state and then a symbol aligned to a dash. Figure 4 shows the pair randomly change the state according to the transition HMM used in our work. The match state M emits probability distribution. The process ends when we pairs of aligned characters, state X emits characters reach the designated final state E. only in the first sequence, and state Y emits charac- The sequence of states visited in the individual ters only in the second sequence. Given two sequences, steps is called a state path. We will denote the proba- we can find the most probable state path that could bility of emitting x in state v as ev (x) and the proba- generate them and this will give us an alignment of bility of transition from state v to w as tv,w . The joint these two sequences. probability of emitting a sequence x = x1 . . . xn along To represent a typical sequence of a motif, we will the state path s = s1 . . . sn in a given HMM is use another kind of HMMs, called profile HMMs [4]. A profile HMM is typically constructed based on an n Y alignment of several motif instances, such as the one P (x, s) = es1 (x1 ) tsi−1 si esi (xi ). in Figure 3. Each position of the motif is represented i=2 by one state with emission probabilities set to the ob- A typical task solved with HMMs is to find the most served frequencies of amino acids in the corresponding probable state path that could generate a given se- alignment column (possibly with some pseudocounts quence, i.e. to find s∗ = arg maxs P (x, s). This task added to avoid zero probabilities). These so called is solved by the Viterbi algorithm based on dynamic match states are arranged in a chain (see Figure 5). programming [19]. These states used alone would generate sequences of The second important notion is sequence align- the same length. However, real sequences may have ment. Given a set of related protein sequences, we can various insertions and deletions compared to the con- Aligning sequences with repetitive motifs 43 ε HMMER in the original input sequences x and y, re- δ spectively. In the second step, MotifAligner computes X q τ scores of all gapless pairwise alignments of motifs 1-2δ-τ xi tk , u` , for all 1 ≤ k ≤ a, 1 ≤ ` ≤ b: δ 1-ε-τ 1-2δ-τ L M τ E X B p xiyj s[tk , u` ] = S[tki , u`i ], (1) i=1 δ 1-ε-τ τ where S[xi , yj ] is the score of aligning amino acids xi Y q yj and yj (they use a standard BLOSUM85 substitution δ matrix [8]; motif occurrences are padded to have the ε same length). τ In this way we obtain a similarity score between each pair of motif occurrences. Next MotifAligner ap- Fig. 4. A pair HMM for global alignment. Transition prob- plies the Needleman-Wunsch algorithm [10] to T abilities are defined by three parameters δ, ε, τ , emission and U , treating motifs as sequence symbols and using probabilities by matrices p and q. matrix s as the substitution matrix. In this way we obtain pairs of aligned zinc fingers between the two proteins. 2 Profile-profile-pair alignment In this section, we present a new approach to align- ment of sequences with repetitive motifs. We adopt an approach similar to MotifAligner, however, we change Fig. 5. Example of a profile HMM. States Mk are match the alignment algorithm and the scoring scheme to states, Ik are insert states and Dk are delete states. States take into account the structure of the repeated motif. B, E, and D1 . . . D3 are silent, which means that they do For example, the zinc-finger motif (Fig. 2) contains not generate any characters. several highly conserved positions, among them the four amino-acids binding the zinc ion (positions 3, 7, 20, 24). These four amino acids are crucial to the func- sensus motif; these are modeled by additional insert tion of the motif and as such should be used to anchor and delete states. Given a profile HMM and a se- the whole alignment. However, the fact that these po- quence, we can again find the most probable state sitions match in the two aligned sequences should not path, which in this case gives us an alignment of the be very surprising and should not by itself contribute sequence to the motif represented by the profile HMM. much to the resulting score. On the other hand, there Note, however, that the profile HMM emits only a sin- are several variable positions, and the differences at gle sequence; the motif itself is represented directly in these positions will be very informative of the evolu- the structure and parameters of the model. tionary distance. To take these issues into account, we have devel- 1.2 MotifAligner approach oped a new profile-profile-pair method (PPP) for To obtain high quality alignments even on sequences with highly variable number of zinc finger motifs, Now- ick et al. developed a pairwise alignment tool called MotifAligner [11]. To our knowledge, it is the only se- quence alignment method designed specifically to align sequences with variable number of repetitive motifs. Part of our work was inspired by this algorithm. MotifAligner first uses a profile HMM tool HMMER [5] and finds all canonical motif occurrences with statistically significant scores in both input se- quences. Let T = (t1 , . . . , ta ) and U = (u1 , . . . , ub ) Fig. 6. The profile HMM of random 2000 human zinc fin- be the sequences of all motif occurrences found by gers from the complete dataset, viewed as a HMM logo [15]. 44 Peter Kováč et al. alignment of individual motifs. The method uses Thus our goal is to compute three paths s∗p , s∗x , s∗y a combination of two profile HMMs and a pair HMM through the pair HMM and the two profile HMMs for sequence alignment and aligns the two sequences that would satisfy our constraints and the product of by finding the best possible path through all three joint probabilities implied by all three models would models simultaneously. To align the complete protein be maximized: sequences containing these repeating motifs, we first align each possible pair of motifs through PPP, com- (s∗p , s∗x , s∗y ) = arg max score(x, y, sp , sx , sy ), (2) pute their similarity score, and use a modification of valid sp ,sx ,sy a traditional global alignment algorithm, now oper- ating on individual motif occurrences as a unit. We where score(x, y, sp , sx , sy ) = Ppair (x, y, sp )· describe the details of the method in the remainder of Pprofile (x, sx ) · Pprofile (y, sy ), where Ppair (x, y, s) is the this section. joint probability of the state path s aligning sequences x and y in the pair HMM and Pprofile (x, s) is 2.1 Pairwise alignment of individual motifs the joint probability of the state path s and sequence x in the profile HMM. The input to PPP consists of two instances of the re- We obtain an optimal solution using dynamic pro- peating motif x = x1 . . . xLx and y = y1 . . . yLy , a pro- gramming similar to the Viterbi algorithm used to file HMM encoding the same motif, and a pair HMM compute the most probable state paths in individual characterizing the properties of a typical alignment. HMMs. Let S = (Sp , Sx , Sy ) be the triplet of states Our goal is to align both x and y to a separate copy of of pair and x-profile and y-profile models satisfying profile HMM and at the same time, use the pair HMM our conditions. We denote V [Sp , Sx , Sy , i, j] the score as a glue. of the highest scoring state path combination ending In particular, we are simultaneously seeking the with the triplet S and covering the prefixes x1 . . . xi , three paths through the three HMMs that satisfy the y1 . . . yj of the two sequences. following constraints: The computation of V [Sp , Sx , Sy , i, j] depends on the types of states Sp , Sx , Sy . For example, if Sp is the Constraint 1 (Profile match states constraint) match state M of the pair HMM and Sx is the match If xi and yj are emitted by the same match state Mk state Mk of the profile HMM, then according to our in their profile models then the pair model has emit xi constraints Sy must be the same match state Mk and and yj together in the match state M . we have the following recurrence: Constraint 2 (Pair match state constraint) If the pair model emits xi and yj together in the match V [M, Mk , Mk , i, j] = eM (xi , yj )eMk (xi )eMk (yj )· state M then both profile models emit xi and yj in the tM M tM` Mk tM` Mk · V [M, M` , M` , i − 1, j − 1] same match state Mk or in the same insert state Ik . for 0 ≤ ` < k t tIk−1 Mk tIk−1 Mk ·V [M, Ik−1 , Ik−1 ,i−1, j −1] M M In other words, if the pair model is in the state X t t M` Mk tMn Mk · V [X, M` , Mn , i − 1, j − 1] XM or Y (which is interpreted as a gap in one of the se- for 0 ≤ `, n < k, n 6= ` quences), the two profile models should not be in the tXM tM` Mk tIk−1 Mk · V [X, M` , Ik−1 , i−1, j −1] same match state: symbols belonging to the same con- for 0 ≤ ` < k sensus column should be aligned. However, if both tXM tIk−1 Mk tM` Mk · V [X, Ik−1 , M` , i−1, j −1] profile models are in the same insert state they can max for 0 ≤ ` < k either be evolutionarily related, in which case they tXM tIk−1 Mk tIk−1 Mk ·V [X, Ik−1 ,Ik−1 , i−1, j −1] should be aligned using M state of the pair model, tY M tM` Mk tMn Mk · V [Y, M` , Mn , i − 1, j − 1] or they could have been inserted in the sequence inde- for 0 ≤ `, n < k, n 6= ` pendently, which would correspond to using X and Y tY M tM` Mk tIk−1 Mk · V [Y, M` , Ik−1 , i − 1, j − 1] states of the pair model. Constraint 2 also implies, that for 0 ≤ ` < k if the profile models are neither in the same match tY M tIk−1 Mk tM` Mk · V [Y, Ik−1 , M` , i−1, j −1] state Mk nor in the same insert state Ik (i.e. either for 0 ≤ ` < k are in completely different columns or in the same col- tY M tIk−1 Mk tIk−1 Mk ·V [Y, Ik−1 ,Ik−1 , i−1, j −1] umn k, but different states Mk and Ik ), which means that the symbols being emitted are unrelated, then The value at V [M, Mk , Mk , i, j] has to include the the pair model should not be in the match state. These emission probabilities of xi and yj in all three models. constraints thus ensure that the sequence and the pro- Then we take a maximum over all choices of previous file alignment can be interpreted in a consistent man- cells from which the current cell value could be com- ner. puted. Every value considered in the maximum is the Aligning sequences with repetitive motifs 45 product of the value of the predecessor cell and tran- Number of Finger Motifs sition probabilities in all three models. All the other Genome Genes Variants Total Average Median cases can be derived analogously; we omit the deriva- hg19 612 1071 13363 12.48 12 tions due to the space constraints. mm9 302 513 5226 10.19 10 canFam2 477 828 9259 11.18 11 Every time we compute a value for any cell, we keep rheMac2 578 1010 12143 12.02 12 a pointer to the cell from which the value was derived. We use those pointers later to trace back the resulting Table 1. The complete dataset, based on genes from the state paths. Of particular importance is the path in the whole human genome. One gene can have multiple variants pair model, since it defines the alignment of x and y. that differ in organization of zinc fingers. For each of the resulting state path, we also compute its joint probability its respective model, obtaining val- ues Ppair (x, y, s∗p ), Pprofile (x, s∗x ), and Pprofile (y, s∗y ). and the length of motifs is almost the same, so we can say that Lx , Ly , L = O(m) and hence the time re- 2.2 Alignment of complete motif arrays quired to compute the alignment of one motif pair is 6 We use the same procedure as MotifAligner for align- O(m ). From the same observation, one can easily see 2 4 ment of complete motif arrays. We compute all pair- that the space complexity is O(n + m ). The running wise alignments of individual motifs, where the score time and memory is practical, since values of n and m of a pairwise motif alignment is based on joint proba- tend to be small in real proteins (for zinc-finger arrays, bilities of motif sequences and state paths in all three both n and m are less than 30). models as described below. Since we perform the mo- tif alignment for all pairs of motifs and assign a score to each such alignment, we get a scoring system simi- 3 Experiments and evaluation lar to a scoring matrix. Treating motifs as symbols and using this scoring matrix, we obtain the full alignment Gold standard data set. We evaluated our approach of input motif arrays using Needleman-Wunsch algo- on human zinc-finger genes and their counterparts in rithm. related species macaque, mouse, and dog. We down- More formally, for motif arrays Ax = (x1 , . . . , xn ) loaded the set of annotations of KRAB zinc finger and Ay = (y1 , . . . ym ), we calculate n × m matrix S, genes from the Human KZNF Catalog [9] and where remapped the annotation to the current hu- Ppair (xi , yj , s∗p,i,j ) man genome assembly hg19 using liftOver tool. To S(xi , yj ) = ln , (3) obtain the sequences of these zinc finger genes in other Pprofile (xi , s∗x,i,j )Pprofile (yj , s∗y,i,j ) species, we used the whole genome alignments from where sp,i,j , sx,i,j and sy,i,j are the three state paths the UCSC genome browser [17] as a mapping between ∗ ∗ ∗ computed when aligning motifs xi , yj by PPP. This the human (hg19) and the macaque (rheMac2), score compares the hypothesis that the two motif se- mouse (mm9), and dog (canFam2) genomes. quences are related (given by probability from the pair The resulting genomic sequences were translated HMM) to the hypothesis that these are simply two into amino acid sequences and cleaned for apparent independent sequences following the same profile (as artifacts. In particular, we removed genes that con- determined by scores from the two profile HMMs). tained fingers shorter than 10 amino acids. Summary statistics of the resulting dataset is shown in Table 1. Because of relatively high time complexity of the PPP 2.3 Algorithm complexity algorithm, alignment of genes with high number of fin- The time complexity of the PPP algorithm on two gers takes a lot of time. For that reason, we prepared motif arrays with O(n) motifs, each of length O(m) a subset of the complete dataset, omitting genes from is O(n2 m6 ). There are O(n2 ) individual motif align- human chromosome 19 and their putative orthologs in ments. The time needed to compute one such align- other genomes. These genes contain the highest num- ment is O(Lx Ly L4 ), where Lx , Ly are the lengths of bers of repeating motifs (30 or more). Summary statis- motifs and L is the number of columns in the pro- tics for this restricted dataset are shown in the Table 2. file HMM. This follows from the observation that in the recurrent step of individual motif alignment we Model parameters. The emission probabilities of the fill 3 × L × L × Lx × Ly matrix, and time required to pair HMM used in our experiments were based on the compute each cell is at most O(L2 ), the upper bound BLOSUM85 substitution matrix. This particular ma- on the number of values considered in the recurrence. trix was chosen in order to compare our results to Mo- Typically, the number of columns in the profile HMM tifAligner [11]. In particular, we used the probability 46 Peter Kováč et al. Histogram of ppp.related[, 2] − (ppp.related[, 3] + ppp.related[, 4]) Histogram of ppp.unrelated[, 2] − (ppp.unrelated[, 3] + ppp.unrelated[, 4]) Number of Finger Motifs 80 Genome Genes Variants Total Average Median 80 hg19 323 510 5249 10.29 9 60 60 mm9 201 314 2818 8.97 8 Frequency Frequency canFam2 257 406 3710 9.14 8 40 40 rheMac2 305 484 4766 9.85 9 20 20 Table 2. The restricted dataset, omitting genes from hu- 0 0 man chromosome 19 and their orthologs. −40 −20 0 20 40 60 80 −40 −20 0 20 40 60 80 Score Score Fig. 7. Score distributions in related (left) and random distributions p and q from which the matrix was de- (right) datasets om PPP model. Histogram of nowick.related[, 1] Histogram of nowick.unrelated[, 1] rived, as supplied in EMBOSS software package [13]. 30 The transition probability parameters (see Figure 4) 80 25 were set as follows: τ = 0.0345, so that the expected 20 length of an alignment is 28, which is the length of 60 Frequency Frequency a typical human C2 H2 zinc finger motif; δ = 0.05185 15 40 so that the expected length of a match region is 13.45, 10 because the most variable region of a zinc finger motif 20 5 spans positions 12-15; ε = 0.4769 so that the expected 0 0 length of a gap is 1.1. −50 0 50 Score 100 150 200 −50 0 50 Score 100 150 200 The complete parameter set of the profile model was acquired from the Pfam database entry for the Fig. 8. The score distributions in related (left) and ran- ZNF C2 H2 family [1]. The length of the profile is 23, dom (right) datasets, MotifAligner approach based on the which is shorter than a typical human zinc finger mo- BLOSUM85 matrix. tif. The reason is that the model is based on a more diverse set of sequences from various species. where the related set was treated as positives and the random set as negative examples. The classification 3.1 The PPP score distribution performance of the PPP is clearly better, demonstrat- ing that our scheme is more suitable as a score for To compare the scoring function of the PPP model classification of paired motifs from random pairs. with the scores used by MotifAligner, we created two sets of zinc-finger motif pairs. The related set con- 3.2 Alignment accuracy tained 1000 fingers from the human genome, each paired with the corresponding finger from macaque, Next we use the PPP and the simpler MotifAligner mouse or dog. The random set contained 1000 ran- method for scoring pairs of zinc fingers as building dom pairs of fingers; we assume that these fingers are blocks in the whole motif array alignment. The on average more distantly related to each other than Needleman-Wunsch algorithm for the whole motif ar- paired fingers in the first set. ray alignment has three parameters: the gap opening We have computed a PPP alignment of each se- penalty g, the gap extension penalty e, and the sub- quence pair in both samples. The score distributions stitution matrix s that scores individual motif align- are shown in Figure 7. Both distributions resemble the ments. For MotifAligner, we have used the original normal distribution, with mean of the related set close parameters [11], in particular the BLOSUM85 substi- to 20 and mean of the random set at around 5. tution matrix and the gap penalties set to g = 84 and For comparison purposes, we have reimplemented e = 75.6. In the PPP model, the matrix s is deter- MotifAligner algorithm as described in [11]. Figure 7 mined by the equation 3, and we have tested several shows the score distributions of the MotifAligner ap- different settings of the parameters g and e. proach to alignment of individual motifs, based on the We carried out three tests. In the first one, we BLOSUM85 substitution matrix. These score distribu- aligned all zinc finger arrays of orthologous proteins in tions do not resemble the normal distribution. In par- the complete dataset. The second and the third exper- ticular, the distribution for the related set has a heavy iments simulated a loss of fingers during the evolution tail, which is clearly not desirable. – we created two artificial datasets with 1/5 and 1/3 The important property of the scoring scheme is of the total number of fingers removed in each zinc how well it is able to distinguish positive examples finger array in all four genomes, and we aligned the from negative ones. Figure 9 shows a ROC curve, original human dataset with the four reduced sets. Aligning sequences with repetitive motifs 47 the only existing program specifically designed to align sequences with repetitive motifs. There is still a room for improvement of our work. Apart from obvious upgrades, like a more efficient im- True Positive Rate plementation, the underlying model can be enhanced in several ways. For example, an interesting question is whether some other scoring function of individual mo- tif alignments would perform better. Such a function might be based on different properties of the under- PPP lying models, e.g. the full probability of a sequence, MotifAligner instead of the probability of the Viterbi path. False Positive Rate To alleviate problems caused by the computational complexity of the algorithm, various heuristics could Fig. 9. ROC curve for related (positive) and random (neg- be applied, especially methods avoiding exhausting ative) datasets. computations of the whole dynamic programming ma- trix. In order to apply our model to other protein fami- Aligned Misaligned lies with repeating motifs, a more robust procedure for Dataset Program arrays motifs parameter estimation should be established. In addi- Complete, Unchanged MotifAligner 2161 234 tion, a method for assessment of statistical significance Complete, Unchanged PPP1 2161 178 of alignments may be helpful when computing align- Complete, Unchanged PPP2 2161 331 ments of large datasets where random similarities are Restricted, 1/5 Loss MotifAligner 1609 149 more likely to occur. Restricted, 1/5 Loss PPP1 1609 139 The model we have implemented is not the only Restricted, 1/5 Loss PPP2 1609 142 way of doing sequence alignment with repetitive mo- Restricted, 1/3 Loss MotifAligner 1651 169 tifs. It is very appealing to use a monolithic proba- Restricted, 1/3 Loss PPP1 1651 254 bilistic model instead of multiplying probabilities of Restricted, 1/3 Loss PPP2 1651 252 three separate models. We have tried to develop such Table 3. The comparison of MotifAligner and PPP model. a model, but we were not able to overcome some of PPP1 refers to Needleman-Wunsch gap penalty parame- their intrinsic difficulties. ters set to g = 30, e = 20 and PPP2 to g = 20, e = 10. From the practical point of view, the most serious The third column lists the number of different zinc fin- problem we have encountered is the lack of reliable ger array pairs aligned; fourth column lists the number of benchmark for assessing the accuracy of alignments wrongly aligned motifs. with repetitive motifs. A high quality reference is very valuable, because it allows exact evaluation of algo- rithms and can give a clue where are the weak and the strong parts of a particular method, or how to In our tests, we achieved the best results when the set the method parameters to ensure optimal perfor- gap opening penalty g was set to 30 and the gap ex- mance. We hope that our work will at least partially tension penalty e to 20. The results of all tests are serve as a catalyst towards the creation of such a re- shown in the Table 3. PPP1 was able to out- source. perform the MotifAligner on the Unchanged and 1/5 Loss datasets. On the other hand, our model per- formed slightly worse as the number of lost fingers was References increased. 1. A. Bateman, S. Boehm, E. L. L. Sonnham- mer, F. Gago: Mulitple sequence align- ment of zinc finger C2H2 type family. Pfam 4 Conclusion Family: zf-C2H2 (PF00096), 2011. Online. http://pfam.sanger.ac.uk/family/PF00096. 2. E. J. Bellefroid, D. A. Poncelet, P. J. Lecocq, O. Reve- We have designed and implemented an algorithm for lant, J. A. Martial: The evolutionarily conserved alignment of sequences with repetitive motifs. The al- Krppel-associated box domain defines a subfamily of gorithm is built on top of two types of hidden Markov eukaryotic multifingered proteins. Proc. Natl. Acad. models. It utilizes positional information from two Sci. U.S.A. 88(9), 1991, 3608–3612. copies of a profile HMM and uses a pair HMM to align 3. G. Ding, P. Lorenz, M. Kreutzer, Y. Li, H. J. Thiesen: the motif sequences. We were able to apply our model SysZNF: the C2H2 zinc finger gene database. Nucleic on real world data, and obtained better results than Acids Res. 37 (Database issue), 2009, D267–273. 48 Peter Kováč et al. 4. R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. 1st edition. Cambridge University Press. 1998, 356 p., ISBN: 978-0521629713. 5. S. R. Eddy: Accelerated profile HMM searches. PLoS Comput. Biol. 7(10), 2011, e1002195. 6. I. Elias: Settling the intractability of multiple align- ment. Journal of Computational Biology 13(7), 2006, 1323–1339. 7. A. T. Hamilton, S. Huntley, M. Tran-Gyamfi, D. M. Baggott, L. Gordon, L. Stubbs: Evolutionary expansion and divergence in the ZNF91 subfamily of primate-specific zinc finger genes. Genome Res. 16(5), 2006, 584–594. 8. S. Henikoff, J. G. Henikoff: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A., 89(22), 1992, 10915–10919. 9. S. Huntley, D. M. Baggott, A. T. Hamilton, M. Tran- Gyamfi, S. Yang, J. Kim, L. Gordon, E. Branscomb, L. Stubbs: A comprehensive catalog of human KRAB- associated zinc finger genes: insights into the evolu- tionary history of a large family of transcriptional re- pressors. Genome Res., 16(5), 2006, 669–677. 10. S. B. Needleman, C. D. Wunsch: A general method ap- plicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48(3), 1970, 443–453. 11. K. Nowick, C. Fields, T. Gernat, D. Caetano-Anolles, N. Kholina, L. Stubbs: Gain, loss and divergence in primate zinc-finger genes: a rich resource for evolution of gene regulatory differences between species. PLoS ONE 6(6), 2011, e21553. 12. K. Nowick, A. T. Hamilton, H. Zhang, L. Stubbs: Rapid sequence and expression divergence suggest se- lection for novel function in primate-specific KRAB- ZNF genes. Molecular Biology and Evolution 27(11), 2010, 2606–2617. 13. P. Rice, I. Longden, A. Bleasby: EMBOSS: the Eu- ropean molecular biology open software suite. Trends Genet., 16(6), 2000, 276–277. 14. D. Schmidt, R. Durrett: Adaptive evolution drives the diversification of zinc-finger binding domains. Mol. Biol. Evol. 21(12), 2004, 2326–2339. 15. B. Schuster-Bockler, J. Schultz, S. Rahmann: HMM Logos for visualization of protein families. BMC Bioin- formatics 5, 2004, 7. 16. J. H. Thomas, R. O. Emerson: Evolution of C2H2-zinc finger genes revisited. BMC Evol. Biol. 9, 2009, 51. 17. UCSC Human Genome Feb. 2009 (hg19, GRCh37) Pairwise Alignments. Online. http://hgdownload.cse.ucsc.edu/downloads.html. 18. R. Urrutia: KRAB-containing zinc-finger repressor proteins. Genome Biology 4(10), 2003, 231. 19. A. J. Viterbi: Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory IT-13, 1967, 260– 267.