=Paper= {{Paper |id=None |storemode=property |title=Aligning sequences with repetitive motifs |pdfUrl=https://ceur-ws.org/Vol-990/paper7.pdf |volume=Vol-990 |dblpUrl=https://dblp.org/rec/conf/itat/KovacBV12 }} ==Aligning sequences with repetitive motifs== https://ceur-ws.org/Vol-990/paper7.pdf
                      Aligning sequences with repetitive motifs?

                                  Peter Kováč, Broňa Brejová, and Tomáš Vinař

                Faculty of Mathematics, Physics, and Informatics, Comenius University in Bratislava,
                                   Mlynská dolina, 842 48 Bratislava, Slovakia
                         kovac.peter@fotopriestor.sk, {vinar,brejova}@fmph.uniba.sk

Abstract. Pairwise sequence alignment is among the
most intensively studied problems in computational biology.                              F

We present a method for alignment of two sequences con-                                                       L

taining repetitive motifs. This is motivated by biological
studies of proteins with zinc finger domain, an important                                    C        H
group of regulatory proteins. Due to their evolutionary his-                                     Zn
tory, sequences of these proteins contain a variable number
                                                                                             C        H
of different zinc fingers (short subsequences with specific                H                              T
symbols at each position).                                                     T G E K   P YF                 G E K
                                                                                                                    P
Our algorithm uses two types of hidden Markov models
(HMM): pair HMMs and profile HMMs. Profile HMMs                Fig. 1. The structure of a zinc-finger. Highly variable sites
describe the structure of sequence motifs. Pair HMMs as-       are marked with black color. The most conserved amino
sign a probability to alignment of two motifs. Combination     acids are the four involved in binding the zinc ion [14].
of the these two types of models yields an algorithm that
uses different score when aligning conserved vs. variable
motif residues. The dynamic programming algorithm that
computes the motif alignments is based on the well known       positions are very conserved due to their importance
Viterbi algorithm. We evaluated our model on sequences of      in assuming desired function, while other positions are
zinc finger proteins and compared it with existing alterna-    highly variable, since they distinguish specific DNA se-
tives.                                                         quences where individual zinc fingers bind (Figure 1).
                                                                   We will focus our attention on the KRAB-ZNF
1     Introduction                                             proteins that have a region encoding one or more
                                                               Krüppel-associated box domains (KRAB, [2]) followed
Pairwise sequence alignment is one of the most stud-           by a zinc finger region (Fig. 2). The human genome en-
ied problems in bioinformatics. We will concentrate on         codes more that 600 of proteins from this family, and
alignment of protein sequences, where a protein can            a lot of effort is dedicated to building and maintaining
be represented as a string over the alphabet of 20 dif-        their catalogues [9], [11], [3]. Complicated repetitive
ferent amino acids. During the evolution, particular           structure of these genes is a result of a dynamic evo-
amino acids in a protein can be substituted by an-             lutionary history, full of sequence duplications [7, 12],
other amino acid, or even get inserted or deleted. The         and many mutations which help to gain new functions
goal of sequence alignment is to compare two proteins,         for duplicated copies.
quantify their sequence similarity, and to identify pairs          The repetitive nature of zinc finger protein
of amino acids that have likely evolved from the same          sequences complicates their sequence alignment. Tra-
amino acid in the common ancestor. Over the years,             ditional alignment methods based purely on sequence
multitude of variations of this problem have been in-          similarity frequently misalign individual zinc fingers,
troduced and many practical software tools were de-            or even align a single zinc finger in one sequence to
veloped.                                                       parts of several different zinc fingers in the other se-
    Our work is motivated by the study of zinc finger          quence. Consequently, many studies of these proteins
proteins. These proteins contain a variable number of          limit their analyses and infer conclusions based only
up to 40 zinc finger domains [18]. Zinc finger domain is       on the the KRAB domains or sequences before the zinc
a stretch of approximately 28 amino acids, the purpose         finger region (e.g. [14], [7]), or dispute the relevance of
of which is to bind DNA at specific places. Comparison         standard methods applied to genes with high variance
of zinc fingers form different proteins reveals that some      in the number of fingers [16].
?
    This work was supported by the European Com-                   In this work, we develop a new method for aligning
    munity FP7 Marie Curie grants IRG-224885 to TV             sequences with repetitive motifs, such as zinc finger
    and IRG-231025 to BB, and by a grant from VEGA             proteins. To overcome the problems outlined above, we
    1/1085/12.                                                 combine the strength of profile hidden Markov models
42        Peter Kováč et al.


      KRAB A
                                                       align them by inserting dashes to individual sequences
                                                       so that they all have the same length and when we ar-
     KRAB A KRAB B                                     range them in a table, as in Figure 3, many columns
                                                       contain the same or similar amino acids. Several con-
Fig. 2. Domain structure of typical KRAB-ZNF genes. secutive dashes form a gap in the alignment, indicat-
The protein contains one or more KRAB domains and an ing that a part of the sequence was deleted or inserted
array of 3 to cca. 40 zinc fingers. [18]               during the evolution. The sequence alignment prob-
                                                       lem can be formulated as an optimization problem and
                                                       solved by existing algorithms. For two sequences, the
which are used to characterize the properties of these problem can be solved easily by Needleman-Wunsch
repetitive motifs, and pair hidden Markov models as dynamic programming algorithm [10], for multiple se-
a model of sequence alignments. We compare our work quences it is NP-hard [6]. The scoring function for
to MotifAligner that was previously used to align zinc pairwise alignment is typically based on a substitution
finger proteins [11], and we find our new method to matrix scoring all pairs of aligned amino acids and on
produce more accurate alignments on our testing set. parameters for scoring gaps: gap opening penalty g for
    In the rest of this section, we introduce nec- the first dash in a gap and gap extension penalty e for
essary background and notation and describe the each additional gap.
MotifAligner approach to repetitive sequence align-
ment in more detail. In Section 2, we describe our
new profile-profile-pair alignment method (PPP). We ZNF626_4799/12 YKC--EECGKAF-NQSSILTTHERIILERN-
present the results of experimental comparison of PPP ZNF727_4861/2 YKC--EECGKDC--RLSDFTIQKRIHTADRS
and MotifAligner in Section 3.                         ZXDB_644/5        YQCAFSGCKKTF-ITVSALFSHNRAHFREQE
                                                                  LLNL1236_4814/2 SMC--PECSKTSATDSSCLLMHQRSHTGKRP
                                                                  ZNF23_141/15    FQC--KECGKAF-HVNAHLIRHQRSHTGEKP
1.1     Background and notation
In this paper, we rely on several standard tools from             Fig. 3. Alignment of five sequences of zinc finger motifs
computational biology, namely alignments, pair hid-               from human proteins.
den Markov models, and profile hidden Markov mod-
els, which we briefly explain in this section.
    We start by defining hidden Markov models                         One way of systematically deriving a scoring func-
(HMMs). An HMM is a probabilistic finite state au-                tion for pairwise alignments is to use pair HMMs [4].
tomaton. We can use it to generate a random sequence              These models emit two sequences simultaneously. In
over some alphabet as follows. We start in a desig-               one step, the HMM can emit a single character in one
nated start state B. In each step, we sample a charac-            of the sequences or in both. The later case corresponds
ter of the sequence from the emission probability dis-            to two symbols aligned to each other, the former to
tribution associated with the current state and then              a symbol aligned to a dash. Figure 4 shows the pair
randomly change the state according to the transition             HMM used in our work. The match state M emits
probability distribution. The process ends when we                pairs of aligned characters, state X emits characters
reach the designated final state E.                               only in the first sequence, and state Y emits charac-
    The sequence of states visited in the individual              ters only in the second sequence. Given two sequences,
steps is called a state path. We will denote the proba-           we can find the most probable state path that could
bility of emitting x in state v as ev (x) and the proba-          generate them and this will give us an alignment of
bility of transition from state v to w as tv,w . The joint        these two sequences.
probability of emitting a sequence x = x1 . . . xn along              To represent a typical sequence of a motif, we will
the state path s = s1 . . . sn in a given HMM is                  use another kind of HMMs, called profile HMMs [4].
                                                                  A profile HMM is typically constructed based on an
                                      n
                                      Y                           alignment of several motif instances, such as the one
               P (x, s) = es1 (x1 )         tsi−1 si esi (xi ).
                                                                  in Figure 3. Each position of the motif is represented
                                      i=2
                                                                  by one state with emission probabilities set to the ob-
A typical task solved with HMMs is to find the most               served frequencies of amino acids in the corresponding
probable state path that could generate a given se-               alignment column (possibly with some pseudocounts
quence, i.e. to find s∗ = arg maxs P (x, s). This task            added to avoid zero probabilities). These so called
is solved by the Viterbi algorithm based on dynamic               match states are arranged in a chain (see Figure 5).
programming [19].                                                 These states used alone would generate sequences of
    The second important notion is sequence align-                the same length. However, real sequences may have
ment. Given a set of related protein sequences, we can            various insertions and deletions compared to the con-
                                                                        Aligning sequences with repetitive motifs     43

                                             ε                  HMMER in the original input sequences x and y, re-
                               δ                                spectively. In the second step, MotifAligner computes
                                          X
                                          q          τ
                                                                scores of all gapless pairwise alignments of motifs
                 1-2δ-τ                     xi
                                                                tk , u` , for all 1 ≤ k ≤ a, 1 ≤ ` ≤ b:
                                   δ   1-ε-τ
               1-2δ-τ                                                                          L
                        M                        τ       E
                                                                                               X
        B               p
                        xiyj
                                                                               s[tk , u` ] =         S[tki , u`i ],   (1)
                                                                                               i=1
                                   δ   1-ε-τ τ
                                                             where S[xi , yj ] is the score of aligning amino acids xi
                                          Y
                                          q yj               and yj (they use a standard BLOSUM85 substitution
                          δ
                                                             matrix [8]; motif occurrences are padded to have the
                                      ε                      same length).
                                   τ                            In this way we obtain a similarity score between
                                                             each pair of motif occurrences. Next MotifAligner ap-
Fig. 4. A pair HMM for global alignment. Transition prob- plies the Needleman-Wunsch algorithm [10] to T
abilities are defined by three parameters δ, ε, τ , emission
                                                             and U , treating motifs as sequence symbols and using
probabilities by matrices p and q.
                                                             matrix s as the substitution matrix. In this way we
                                                             obtain pairs of aligned zinc fingers between the two
                                                             proteins.


                                                                2   Profile-profile-pair alignment

                                                                In this section, we present a new approach to align-
                                                                ment of sequences with repetitive motifs. We adopt an
                                                                approach similar to MotifAligner, however, we change
Fig. 5. Example of a profile HMM. States Mk are match           the alignment algorithm and the scoring scheme to
states, Ik are insert states and Dk are delete states. States   take into account the structure of the repeated motif.
B, E, and D1 . . . D3 are silent, which means that they do          For example, the zinc-finger motif (Fig. 2) contains
not generate any characters.                                    several highly conserved positions, among them the
                                                                four amino-acids binding the zinc ion (positions 3, 7,
                                                                20, 24). These four amino acids are crucial to the func-
sensus motif; these are modeled by additional insert            tion of the motif and as such should be used to anchor
and delete states. Given a profile HMM and a se-                the whole alignment. However, the fact that these po-
quence, we can again find the most probable state               sitions match in the two aligned sequences should not
path, which in this case gives us an alignment of the           be very surprising and should not by itself contribute
sequence to the motif represented by the profile HMM.           much to the resulting score. On the other hand, there
Note, however, that the profile HMM emits only a sin-           are several variable positions, and the differences at
gle sequence; the motif itself is represented directly in       these positions will be very informative of the evolu-
the structure and parameters of the model.                      tionary distance.
                                                                    To take these issues into account, we have devel-
1.2   MotifAligner approach                                     oped a new profile-profile-pair method (PPP) for

To obtain high quality alignments even on sequences
with highly variable number of zinc finger motifs, Now-
ick et al. developed a pairwise alignment tool called
MotifAligner [11]. To our knowledge, it is the only se-
quence alignment method designed specifically to align
sequences with variable number of repetitive motifs.
Part of our work was inspired by this algorithm.
    MotifAligner first uses a profile HMM tool
HMMER [5] and finds all canonical motif occurrences
with statistically significant scores in both input se-
quences. Let T = (t1 , . . . , ta ) and U = (u1 , . . . , ub ) Fig. 6. The profile HMM of random 2000 human zinc fin-
be the sequences of all motif occurrences found by gers from the complete dataset, viewed as a HMM logo [15].
44      Peter Kováč et al.

alignment of individual motifs. The method uses                    Thus our goal is to compute three paths s∗p , s∗x , s∗y
a combination of two profile HMMs and a pair HMM               through the pair HMM and the two profile HMMs
for sequence alignment and aligns the two sequences            that would satisfy our constraints and the product of
by finding the best possible path through all three            joint probabilities implied by all three models would
models simultaneously. To align the complete protein           be maximized:
sequences containing these repeating motifs, we first
align each possible pair of motifs through PPP, com-
                                                                  (s∗p , s∗x , s∗y ) = arg max score(x, y, sp , sx , sy ),    (2)
pute their similarity score, and use a modification of                                    valid
                                                                                        sp ,sx ,sy
a traditional global alignment algorithm, now oper-
ating on individual motif occurrences as a unit. We            where score(x, y, sp , sx , sy ) = Ppair (x, y, sp )·
describe the details of the method in the remainder of         Pprofile (x, sx ) · Pprofile (y, sy ), where Ppair (x, y, s) is the
this section.                                                  joint probability of the state path s aligning
                                                               sequences x and y in the pair HMM and Pprofile (x, s) is
2.1   Pairwise alignment of individual motifs                  the joint probability of the state path s and sequence x
                                                               in the profile HMM.
The input to PPP consists of two instances of the re-               We obtain an optimal solution using dynamic pro-
peating motif x = x1 . . . xLx and y = y1 . . . yLy , a pro-   gramming similar to the Viterbi algorithm used to
file HMM encoding the same motif, and a pair HMM               compute the most probable state paths in individual
characterizing the properties of a typical alignment.          HMMs. Let S = (Sp , Sx , Sy ) be the triplet of states
Our goal is to align both x and y to a separate copy of        of pair and x-profile and y-profile models satisfying
profile HMM and at the same time, use the pair HMM             our conditions. We denote V [Sp , Sx , Sy , i, j] the score
as a glue.                                                     of the highest scoring state path combination ending
     In particular, we are simultaneously seeking the          with the triplet S and covering the prefixes x1 . . . xi ,
three paths through the three HMMs that satisfy the            y1 . . . yj of the two sequences.
following constraints:                                              The computation of V [Sp , Sx , Sy , i, j] depends on
                                                               the types of states Sp , Sx , Sy . For example, if Sp is the
Constraint 1 (Profile match states constraint)                 match state M of the pair HMM and Sx is the match
If xi and yj are emitted by the same match state Mk            state Mk of the profile HMM, then according to our
in their profile models then the pair model has emit xi        constraints Sy must be the same match state Mk and
and yj together in the match state M .                         we have the following recurrence:
Constraint 2 (Pair match state constraint) If
the pair model emits xi and yj together in the match           V [M, Mk , Mk , i, j] = eM (xi , yj )eMk (xi )eMk (yj )·
state M then both profile models emit xi and yj in the
                                                                   
                                                                   tM M tM` Mk tM` Mk · V [M, M` , M` , i − 1, j − 1]
                                                                   
same match state Mk or in the same insert state Ik .               
                                                                   
                                                                         for 0 ≤ ` < k
                                                                   
                                                                    t     tIk−1 Mk tIk−1 Mk ·V [M, Ik−1 , Ik−1 ,i−1, j −1]
                                                                   
                                                                      M M
                                                                   
     In other words, if the pair model is in the state X
                                                                   
                                                                   
                                                                    t    t M` Mk tMn Mk · V [X, M` , Mn , i − 1, j − 1]
                                                                   
                                                                      XM
                                                                   
or Y (which is interpreted as a gap in one of the se-
                                                                   
                                                                   
                                                                          for 0 ≤ `, n < k, n 6= `
                                                                   
                                                                   
quences), the two profile models should not be in the
                                                                   
                                                                   
                                                                   tXM tM` Mk tIk−1 Mk · V [X, M` , Ik−1 , i−1, j −1]
                                                                   
                                                                   
same match state: symbols belonging to the same con-               
                                                                          for 0 ≤ ` < k
                                                                   
                                                                   
sensus column should be aligned. However, if both
                                                                   
                                                                   
                                                                   tXM tIk−1 Mk tM` Mk · V [X, Ik−1 , M` , i−1, j −1]
                                                                   
                                                                   
profile models are in the same insert state they can
                                                               max        for 0 ≤ ` < k
either be evolutionarily related, in which case they
                                                                   tXM tIk−1 Mk tIk−1 Mk ·V [X, Ik−1 ,Ik−1 , i−1, j −1]
                                                                   
                                                                   
should be aligned using M state of the pair model,                 
                                                                   tY M tM` Mk tMn Mk · V [Y, M` , Mn , i − 1, j − 1]
                                                                   
                                                                   
or they could have been inserted in the sequence inde-             
                                                                          for 0 ≤ `, n < k, n 6= `
                                                                   
                                                                   
pendently, which would correspond to using X and Y
                                                                   
                                                                   
                                                                   tY M tM` Mk tIk−1 Mk · V [Y, M` , Ik−1 , i − 1, j − 1]
                                                                   
                                                                   
states of the pair model. Constraint 2 also implies, that          
                                                                          for 0 ≤ ` < k
                                                                   
                                                                   
if the profile models are neither in the same match
                                                                   
                                                                   
                                                                    tY M tIk−1 Mk tM` Mk · V [Y, Ik−1 , M` , i−1, j −1]
                                                                   
                                                                   
state Mk nor in the same insert state Ik (i.e. either
                                                                   
                                                                   
                                                                          for 0 ≤ ` < k
                                                                   
                                                                   
are in completely different columns or in the same col-
                                                                   
                                                                   
                                                                    tY M tIk−1 Mk tIk−1 Mk ·V [Y, Ik−1 ,Ik−1 , i−1, j −1]
                                                                   
umn k, but different states Mk and Ik ), which means
that the symbols being emitted are unrelated, then             The value at V [M, Mk , Mk , i, j] has to include the
the pair model should not be in the match state. These         emission probabilities of xi and yj in all three models.
constraints thus ensure that the sequence and the pro-         Then we take a maximum over all choices of previous
file alignment can be interpreted in a consistent man-         cells from which the current cell value could be com-
ner.                                                           puted. Every value considered in the maximum is the
                                                                              Aligning sequences with repetitive motifs        45

product of the value of the predecessor cell and tran-                            Number of                Finger Motifs
sition probabilities in all three models. All the other               Genome    Genes Variants     Total     Average Median
cases can be derived analogously; we omit the deriva-                 hg19        612    1071      13363       12.48     12
tions due to the space constraints.                                   mm9         302      513      5226       10.19     10
                                                                      canFam2     477      828      9259       11.18     11
     Every time we compute a value for any cell, we keep
                                                                      rheMac2     578    1010      12143       12.02     12
a pointer to the cell from which the value was derived.
We use those pointers later to trace back the resulting               Table 1. The complete dataset, based on genes from the
state paths. Of particular importance is the path in the              whole human genome. One gene can have multiple variants
pair model, since it defines the alignment of x and y.                that differ in organization of zinc fingers.
For each of the resulting state path, we also compute
its joint probability its respective model, obtaining val-
ues Ppair (x, y, s∗p ), Pprofile (x, s∗x ), and Pprofile (y, s∗y ).
                                                                          and the length of motifs is almost the same, so we
                                                                          can say that Lx , Ly , L = O(m) and hence the time re-
2.2 Alignment of complete motif arrays                                    quired to compute the alignment of one motif pair is
                                                                               6
We use the same procedure as MotifAligner for align- O(m ). From the same observation, one can easily see
                                                                                                             2     4
ment of complete motif arrays. We compute all pair- that the space complexity is O(n + m ). The running
wise alignments of individual motifs, where the score time and memory is practical, since values of n and m
of a pairwise motif alignment is based on joint proba- tend to be small in real proteins (for zinc-finger arrays,
bilities of motif sequences and state paths in all three both n and m are less than 30).
models as described below. Since we perform the mo-
tif alignment for all pairs of motifs and assign a score
to each such alignment, we get a scoring system simi- 3 Experiments and evaluation
lar to a scoring matrix. Treating motifs as symbols and
using this scoring matrix, we obtain the full alignment Gold standard data set. We evaluated our approach
of input motif arrays using Needleman-Wunsch algo- on human zinc-finger genes and their counterparts in
rithm.                                                                    related species macaque, mouse, and dog. We down-
    More formally, for motif arrays Ax = (x1 , . . . , xn ) loaded the set of annotations of KRAB zinc finger
and Ay = (y1 , . . . ym ), we calculate n × m matrix S, genes from the Human KZNF Catalog [9] and
where                                                                     remapped the annotation to the current hu-
                             Ppair (xi , yj , s∗p,i,j )                   man genome assembly hg19 using liftOver tool. To
  S(xi , yj ) = ln                                                  , (3) obtain the sequences of these zinc finger genes in other
                   Pprofile (xi , s∗x,i,j )Pprofile (yj , s∗y,i,j )       species, we used the whole genome alignments from
where sp,i,j , sx,i,j and sy,i,j are the three state paths the UCSC genome browser [17] as a mapping between
          ∗      ∗            ∗

computed when aligning motifs xi , yj by PPP. This the human (hg19) and the macaque (rheMac2),
score compares the hypothesis that the two motif se- mouse (mm9), and dog (canFam2) genomes.
quences are related (given by probability from the pair                       The resulting genomic sequences were translated
HMM) to the hypothesis that these are simply two                          into  amino acid sequences and cleaned for apparent
independent sequences following the same profile (as                      artifacts. In particular, we removed genes that con-
determined by scores from the two profile HMMs).                          tained  fingers shorter than 10 amino acids. Summary
                                                                          statistics of the resulting dataset is shown in Table 1.
                                                                          Because of relatively high time complexity of the PPP
2.3 Algorithm complexity                                                  algorithm, alignment of genes with high number of fin-
The time complexity of the PPP algorithm on two gers takes a lot of time. For that reason, we prepared
motif arrays with O(n) motifs, each of length O(m) a subset of the complete dataset, omitting genes from
is O(n2 m6 ). There are O(n2 ) individual motif align- human chromosome 19 and their putative orthologs in
ments. The time needed to compute one such align- other genomes. These genes contain the highest num-
ment is O(Lx Ly L4 ), where Lx , Ly are the lengths of bers of repeating motifs (30 or more). Summary statis-
motifs and L is the number of columns in the pro- tics for this restricted dataset are shown in the Table 2.
file HMM. This follows from the observation that in
the recurrent step of individual motif alignment we                   Model parameters. The emission probabilities of the
fill 3 × L × L × Lx × Ly matrix, and time required to                 pair HMM used in our experiments were based on the
compute each cell is at most O(L2 ), the upper bound                  BLOSUM85 substitution matrix. This particular ma-
on the number of values considered in the recurrence.                 trix was chosen in order to compare our results to Mo-
Typically, the number of columns in the profile HMM                   tifAligner [11]. In particular, we used the probability
46     Peter Kováč et al.
                                                                            Histogram of ppp.related[, 2] − (ppp.related[, 3] + ppp.related[, 4])               Histogram of ppp.unrelated[, 2] − (ppp.unrelated[, 3] + ppp.unrelated[, 4])



            Number of                 Finger Motifs




                                                                       80
Genome    Genes Variants      Total    Average Median




                                                                                                                                                                 80
hg19        323      510      5249        10.29     9




                                                                       60




                                                                                                                                                                 60
mm9         201      314      2818         8.97     8




                                                           Frequency




                                                                                                                                                    Frequency
canFam2     257      406      3710         9.14     8




                                                                       40




                                                                                                                                                                 40
rheMac2     305      484      4766         9.85     9




                                                                       20




                                                                                                                                                                 20
Table 2. The restricted dataset, omitting genes from hu-




                                                                       0




                                                                                                                                                                 0
man chromosome 19 and their orthologs.                                        −40      −20          0      20       40       60         80                            −40      −20           0    20       40       60         80

                                                                                                            Score                                                                                  Score




                                                           Fig. 7. Score distributions in related (left) and random
distributions p and q from which the matrix was de-        (right) datasets om PPP model.
                                                                                             Histogram of nowick.related[, 1]                                                        Histogram of nowick.unrelated[, 1]

rived, as supplied in EMBOSS software package [13].




                                                                                                                                                                 30
The transition probability parameters (see Figure 4)




                                                                       80




                                                                                                                                                                 25
were set as follows: τ = 0.0345, so that the expected




                                                                                                                                                                 20
length of an alignment is 28, which is the length of


                                                                       60
                                                           Frequency




                                                                                                                                                    Frequency
a typical human C2 H2 zinc finger motif; δ = 0.05185




                                                                                                                                                                 15
                                                                       40
so that the expected length of a match region is 13.45,




                                                                                                                                                                 10
because the most variable region of a zinc finger motif                20




                                                                                                                                                                 5
spans positions 12-15; ε = 0.4769 so that the expected
                                                                       0




                                                                                                                                                                 0
length of a gap is 1.1.                                                             −50         0          50

                                                                                                            Score
                                                                                                                     100          150        200                            −50          0        50

                                                                                                                                                                                                   Score
                                                                                                                                                                                                            100          150        200



     The complete parameter set of the profile model
was acquired from the Pfam database entry for the Fig. 8. The score distributions in related (left) and ran-
ZNF C2 H2 family [1]. The length of the profile is 23, dom (right) datasets, MotifAligner approach based on the
which is shorter than a typical human zinc finger mo- BLOSUM85 matrix.
tif. The reason is that the model is based on a more
diverse set of sequences from various species.
                                                          where the related set was treated as positives and the
                                                          random set as negative examples. The classification
3.1 The PPP score distribution                            performance of the PPP is clearly better, demonstrat-
                                                          ing that our scheme is more suitable as a score for
To compare the scoring function of the PPP model classification of paired motifs from random pairs.
with the scores used by MotifAligner, we created two
sets of zinc-finger motif pairs. The related set con-
                                                          3.2 Alignment accuracy
tained 1000 fingers from the human genome, each
paired with the corresponding finger from macaque, Next we use the PPP and the simpler MotifAligner
mouse or dog. The random set contained 1000 ran- method for scoring pairs of zinc fingers as building
dom pairs of fingers; we assume that these fingers are blocks in the whole motif array alignment. The
on average more distantly related to each other than Needleman-Wunsch algorithm for the whole motif ar-
paired fingers in the first set.                          ray alignment has three parameters: the gap opening
     We have computed a PPP alignment of each se- penalty g, the gap extension penalty e, and the sub-
quence pair in both samples. The score distributions stitution matrix s that scores individual motif align-
are shown in Figure 7. Both distributions resemble the ments. For MotifAligner, we have used the original
normal distribution, with mean of the related set close parameters [11], in particular the BLOSUM85 substi-
to 20 and mean of the random set at around 5.             tution matrix and the gap penalties set to g = 84 and
     For comparison purposes, we have reimplemented e = 75.6. In the PPP model, the matrix s is deter-
MotifAligner algorithm as described in [11]. Figure 7 mined by the equation 3, and we have tested several
shows the score distributions of the MotifAligner ap- different settings of the parameters g and e.
proach to alignment of individual motifs, based on the        We carried out three tests. In the first one, we
BLOSUM85 substitution matrix. These score distribu- aligned all zinc finger arrays of orthologous proteins in
tions do not resemble the normal distribution. In par- the complete dataset. The second and the third exper-
ticular, the distribution for the related set has a heavy iments simulated a loss of fingers during the evolution
tail, which is clearly not desirable.                     – we created two artificial datasets with 1/5 and 1/3
     The important property of the scoring scheme is of the total number of fingers removed in each zinc
how well it is able to distinguish positive examples finger array in all four genomes, and we aligned the
from negative ones. Figure 9 shows a ROC curve, original human dataset with the four reduced sets.
                                                                                 Aligning sequences with repetitive motifs        47

                                                                         the only existing program specifically designed to align
                                                                         sequences with repetitive motifs.
                                                                             There is still a room for improvement of our work.
                                                                         Apart from obvious upgrades, like a more efficient im-
         True Positive Rate




                                                                         plementation, the underlying model can be enhanced
                                                                         in several ways. For example, an interesting question is
                                                                         whether some other scoring function of individual mo-
                                                                         tif alignments would perform better. Such a function
                                                                         might be based on different properties of the under-
                                                    PPP                  lying models, e.g. the full probability of a sequence,
                                            MotifAligner
                                                                         instead of the probability of the Viterbi path.
                              False Positive Rate                            To alleviate problems caused by the computational
                                                                         complexity of the algorithm, various heuristics could
Fig. 9. ROC curve for related (positive) and random (neg-                be applied, especially methods avoiding exhausting
ative) datasets.                                                         computations of the whole dynamic programming ma-
                                                                         trix. In order to apply our model to other protein fami-
                                                    Aligned Misaligned   lies with repeating motifs, a more robust procedure for
 Dataset              Program                        arrays motifs       parameter estimation should be established. In addi-
 Complete, Unchanged MotifAligner                      2161   234        tion, a method for assessment of statistical significance
 Complete, Unchanged PPP1                              2161   178        of alignments may be helpful when computing align-
 Complete, Unchanged PPP2                              2161   331        ments of large datasets where random similarities are
 Restricted, 1/5 Loss MotifAligner                     1609   149        more likely to occur.
 Restricted, 1/5 Loss PPP1                             1609   139            The model we have implemented is not the only
 Restricted, 1/5 Loss PPP2                             1609   142        way of doing sequence alignment with repetitive mo-
 Restricted, 1/3 Loss MotifAligner                     1651   169        tifs. It is very appealing to use a monolithic proba-
 Restricted, 1/3 Loss PPP1                             1651   254        bilistic model instead of multiplying probabilities of
 Restricted, 1/3 Loss PPP2                             1651   252        three separate models. We have tried to develop such
Table 3. The comparison of MotifAligner and PPP model.                   a model, but we were not able to overcome some of
PPP1 refers to Needleman-Wunsch gap penalty parame-                      their intrinsic difficulties.
ters set to g = 30, e = 20 and PPP2 to g = 20, e = 10.                       From the practical point of view, the most serious
The third column lists the number of different zinc fin-                 problem we have encountered is the lack of reliable
ger array pairs aligned; fourth column lists the number of               benchmark for assessing the accuracy of alignments
wrongly aligned motifs.                                                  with repetitive motifs. A high quality reference is very
                                                                         valuable, because it allows exact evaluation of algo-
                                                                         rithms and can give a clue where are the weak and
                                                                         the strong parts of a particular method, or how to
    In our tests, we achieved the best results when the                  set the method parameters to ensure optimal perfor-
gap opening penalty g was set to 30 and the gap ex-                      mance. We hope that our work will at least partially
tension penalty e to 20. The results of all tests are                    serve as a catalyst towards the creation of such a re-
shown in the Table 3. PPP1 was able to out-                              source.
perform the MotifAligner on the Unchanged and
1/5 Loss datasets. On the other hand, our model per-
formed slightly worse as the number of lost fingers was                  References
increased.
                                                                          1. A. Bateman, S. Boehm, E. L. L. Sonnham-
                                                                             mer, F. Gago:              Mulitple sequence align-
                                                                             ment of zinc finger C2H2 type family. Pfam
4    Conclusion                                                              Family: zf-C2H2 (PF00096), 2011.                 Online.
                                                                             http://pfam.sanger.ac.uk/family/PF00096.
                                                                          2. E. J. Bellefroid, D. A. Poncelet, P. J. Lecocq, O. Reve-
We have designed and implemented an algorithm for
                                                                             lant, J. A. Martial:     The evolutionarily conserved
alignment of sequences with repetitive motifs. The al-
                                                                             Krppel-associated box domain defines a subfamily of
gorithm is built on top of two types of hidden Markov                        eukaryotic multifingered proteins. Proc. Natl. Acad.
models. It utilizes positional information from two                          Sci. U.S.A. 88(9), 1991, 3608–3612.
copies of a profile HMM and uses a pair HMM to align                      3. G. Ding, P. Lorenz, M. Kreutzer, Y. Li, H. J. Thiesen:
the motif sequences. We were able to apply our model                         SysZNF: the C2H2 zinc finger gene database. Nucleic
on real world data, and obtained better results than                         Acids Res. 37 (Database issue), 2009, D267–273.
48      Peter Kováč et al.

 4. R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological
    sequence analysis. 1st edition. Cambridge University
    Press. 1998, 356 p., ISBN: 978-0521629713.
 5. S. R. Eddy: Accelerated profile HMM searches. PLoS
    Comput. Biol. 7(10), 2011, e1002195.
 6. I. Elias: Settling the intractability of multiple align-
    ment. Journal of Computational Biology 13(7), 2006,
    1323–1339.
 7. A. T. Hamilton, S. Huntley, M. Tran-Gyamfi,
    D. M. Baggott, L. Gordon, L. Stubbs: Evolutionary
    expansion and divergence in the ZNF91 subfamily of
    primate-specific zinc finger genes. Genome Res. 16(5),
    2006, 584–594.
 8. S. Henikoff, J. G. Henikoff: Amino acid substitution
    matrices from protein blocks. Proc. Natl. Acad. Sci.
    U.S.A., 89(22), 1992, 10915–10919.
 9. S. Huntley, D. M. Baggott, A. T. Hamilton, M. Tran-
    Gyamfi, S. Yang, J. Kim, L. Gordon, E. Branscomb,
    L. Stubbs: A comprehensive catalog of human KRAB-
    associated zinc finger genes: insights into the evolu-
    tionary history of a large family of transcriptional re-
    pressors. Genome Res., 16(5), 2006, 669–677.
10. S. B. Needleman, C. D. Wunsch: A general method ap-
    plicable to the search for similarities in the amino acid
    sequence of two proteins. J. Mol. Biol., 48(3), 1970,
    443–453.
11. K. Nowick, C. Fields, T. Gernat, D. Caetano-Anolles,
    N. Kholina, L. Stubbs: Gain, loss and divergence in
    primate zinc-finger genes: a rich resource for evolution
    of gene regulatory differences between species. PLoS
    ONE 6(6), 2011, e21553.
12. K. Nowick, A. T. Hamilton, H. Zhang, L. Stubbs:
    Rapid sequence and expression divergence suggest se-
    lection for novel function in primate-specific KRAB-
    ZNF genes. Molecular Biology and Evolution 27(11),
    2010, 2606–2617.
13. P. Rice, I. Longden, A. Bleasby: EMBOSS: the Eu-
    ropean molecular biology open software suite. Trends
    Genet., 16(6), 2000, 276–277.
14. D. Schmidt, R. Durrett: Adaptive evolution drives the
    diversification of zinc-finger binding domains. Mol.
    Biol. Evol. 21(12), 2004, 2326–2339.
15. B. Schuster-Bockler, J. Schultz, S. Rahmann: HMM
    Logos for visualization of protein families. BMC Bioin-
    formatics 5, 2004, 7.
16. J. H. Thomas, R. O. Emerson: Evolution of C2H2-zinc
    finger genes revisited. BMC Evol. Biol. 9, 2009, 51.
17. UCSC          Human Genome Feb. 2009 (hg19,
    GRCh37)        Pairwise     Alignments.           Online.
    http://hgdownload.cse.ucsc.edu/downloads.html.
18. R. Urrutia: KRAB-containing zinc-finger repressor
    proteins. Genome Biology 4(10), 2003, 231.
19. A. J. Viterbi: Error bounds for convolutional codes and
    an asymtotically optimum decoding algorithm. IEEE
    Transactions on Information Theory IT-13, 1967, 260–
    267.