=Paper= {{Paper |id=None |storemode=property |title=Aligning sequences with repetitive motifs |pdfUrl=https://ceur-ws.org/Vol-990/paper7.pdf |volume=Vol-990 |dblpUrl=https://dblp.org/rec/conf/itat/KovacBV12 }} ==Aligning sequences with repetitive motifs== https://ceur-ws.org/Vol-990/paper7.pdf

Aligning sequences with repetitive motifs?

Peter Kováč, Broňa Brejová, and Tomáš Vinař

Faculty of Mathematics, Physics, and Informatics, Comenius University in Bratislava,
Mlynská dolina, 842 48 Bratislava, Slovakia
kovac.peter@fotopriestor.sk, {vinar,brejova}@fmph.uniba.sk

Abstract. Pairwise sequence alignment is among the
most intensively studied problems in computational biology. F

We present a method for alignment of two sequences con- L

taining repetitive motifs. This is motivated by biological
studies of proteins with zinc finger domain, an important C H
group of regulatory proteins. Due to their evolutionary his- Zn
tory, sequences of these proteins contain a variable number
C H
of different zinc fingers (short subsequences with specific H T
symbols at each position). T G E K P YF G E K
P
Our algorithm uses two types of hidden Markov models
(HMM): pair HMMs and profile HMMs. Profile HMMs Fig. 1. The structure of a zinc-finger. Highly variable sites
describe the structure of sequence motifs. Pair HMMs as- are marked with black color. The most conserved amino
sign a probability to alignment of two motifs. Combination acids are the four involved in binding the zinc ion [14].
of the these two types of models yields an algorithm that
uses different score when aligning conserved vs. variable
motif residues. The dynamic programming algorithm that
computes the motif alignments is based on the well known positions are very conserved due to their importance
Viterbi algorithm. We evaluated our model on sequences of in assuming desired function, while other positions are
zinc finger proteins and compared it with existing alterna- highly variable, since they distinguish specific DNA se-
tives. quences where individual zinc fingers bind (Figure 1).
We will focus our attention on the KRAB-ZNF
1 Introduction proteins that have a region encoding one or more
Krüppel-associated box domains (KRAB, [2]) followed
Pairwise sequence alignment is one of the most stud- by a zinc finger region (Fig. 2). The human genome en-
ied problems in bioinformatics. We will concentrate on codes more that 600 of proteins from this family, and
alignment of protein sequences, where a protein can a lot of effort is dedicated to building and maintaining
be represented as a string over the alphabet of 20 dif- their catalogues [9], [11], [3]. Complicated repetitive
ferent amino acids. During the evolution, particular structure of these genes is a result of a dynamic evo-
amino acids in a protein can be substituted by an- lutionary history, full of sequence duplications [7, 12],
other amino acid, or even get inserted or deleted. The and many mutations which help to gain new functions
goal of sequence alignment is to compare two proteins, for duplicated copies.
quantify their sequence similarity, and to identify pairs The repetitive nature of zinc finger protein
of amino acids that have likely evolved from the same sequences complicates their sequence alignment. Tra-
amino acid in the common ancestor. Over the years, ditional alignment methods based purely on sequence
multitude of variations of this problem have been in- similarity frequently misalign individual zinc fingers,
troduced and many practical software tools were de- or even align a single zinc finger in one sequence to
veloped. parts of several different zinc fingers in the other se-
Our work is motivated by the study of zinc finger quence. Consequently, many studies of these proteins
proteins. These proteins contain a variable number of limit their analyses and infer conclusions based only
up to 40 zinc finger domains [18]. Zinc finger domain is on the the KRAB domains or sequences before the zinc
a stretch of approximately 28 amino acids, the purpose finger region (e.g. [14], [7]), or dispute the relevance of
of which is to bind DNA at specific places. Comparison standard methods applied to genes with high variance
of zinc fingers form different proteins reveals that some in the number of fingers [16].
?
This work was supported by the European Com- In this work, we develop a new method for aligning
munity FP7 Marie Curie grants IRG-224885 to TV sequences with repetitive motifs, such as zinc finger
and IRG-231025 to BB, and by a grant from VEGA proteins. To overcome the problems outlined above, we
1/1085/12. combine the strength of profile hidden Markov models
42 Peter Kováč et al.

KRAB A
align them by inserting dashes to individual sequences
so that they all have the same length and when we ar-
KRAB A KRAB B range them in a table, as in Figure 3, many columns
contain the same or similar amino acids. Several con-
Fig. 2. Domain structure of typical KRAB-ZNF genes. secutive dashes form a gap in the alignment, indicat-
The protein contains one or more KRAB domains and an ing that a part of the sequence was deleted or inserted
array of 3 to cca. 40 zinc fingers. [18] during the evolution. The sequence alignment prob-
lem can be formulated as an optimization problem and
solved by existing algorithms. For two sequences, the
which are used to characterize the properties of these problem can be solved easily by Needleman-Wunsch
repetitive motifs, and pair hidden Markov models as dynamic programming algorithm [10], for multiple se-
a model of sequence alignments. We compare our work quences it is NP-hard [6]. The scoring function for
to MotifAligner that was previously used to align zinc pairwise alignment is typically based on a substitution
finger proteins [11], and we find our new method to matrix scoring all pairs of aligned amino acids and on
produce more accurate alignments on our testing set. parameters for scoring gaps: gap opening penalty g for
In the rest of this section, we introduce nec- the first dash in a gap and gap extension penalty e for
essary background and notation and describe the each additional gap.
MotifAligner approach to repetitive sequence align-
ment in more detail. In Section 2, we describe our
new profile-profile-pair alignment method (PPP). We ZNF626_4799/12 YKC--EECGKAF-NQSSILTTHERIILERN-
present the results of experimental comparison of PPP ZNF727_4861/2 YKC--EECGKDC--RLSDFTIQKRIHTADRS
and MotifAligner in Section 3. ZXDB_644/5 YQCAFSGCKKTF-ITVSALFSHNRAHFREQE
LLNL1236_4814/2 SMC--PECSKTSATDSSCLLMHQRSHTGKRP
ZNF23_141/15 FQC--KECGKAF-HVNAHLIRHQRSHTGEKP
1.1 Background and notation
In this paper, we rely on several standard tools from Fig. 3. Alignment of five sequences of zinc finger motifs
computational biology, namely alignments, pair hid- from human proteins.
den Markov models, and profile hidden Markov mod-
els, which we briefly explain in this section.
We start by defining hidden Markov models One way of systematically deriving a scoring func-
(HMMs). An HMM is a probabilistic finite state au- tion for pairwise alignments is to use pair HMMs [4].
tomaton. We can use it to generate a random sequence These models emit two sequences simultaneously. In
over some alphabet as follows. We start in a desig- one step, the HMM can emit a single character in one
nated start state B. In each step, we sample a charac- of the sequences or in both. The later case corresponds
ter of the sequence from the emission probability dis- to two symbols aligned to each other, the former to
tribution associated with the current state and then a symbol aligned to a dash. Figure 4 shows the pair
randomly change the state according to the transition HMM used in our work. The match state M emits
probability distribution. The process ends when we pairs of aligned characters, state X emits characters
reach the designated final state E. only in the first sequence, and state Y emits charac-
The sequence of states visited in the individual ters only in the second sequence. Given two sequences,
steps is called a state path. We will denote the proba- we can find the most probable state path that could
bility of emitting x in state v as ev (x) and the proba- generate them and this will give us an alignment of
bility of transition from state v to w as tv,w . The joint these two sequences.
probability of emitting a sequence x = x1 . . . xn along To represent a typical sequence of a motif, we will
the state path s = s1 . . . sn in a given HMM is use another kind of HMMs, called profile HMMs [4].
A profile HMM is typically constructed based on an
n
Y alignment of several motif instances, such as the one
P (x, s) = es1 (x1 ) tsi−1 si esi (xi ).
in Figure 3. Each position of the motif is represented
i=2
by one state with emission probabilities set to the ob-
A typical task solved with HMMs is to find the most served frequencies of amino acids in the corresponding
probable state path that could generate a given se- alignment column (possibly with some pseudocounts
quence, i.e. to find s∗ = arg maxs P (x, s). This task added to avoid zero probabilities). These so called
is solved by the Viterbi algorithm based on dynamic match states are arranged in a chain (see Figure 5).
programming [19]. These states used alone would generate sequences of
The second important notion is sequence align- the same length. However, real sequences may have
ment. Given a set of related protein sequences, we can various insertions and deletions compared to the con-
Aligning sequences with repetitive motifs 43

ε HMMER in the original input sequences x and y, re-
δ spectively. In the second step, MotifAligner computes
X
q τ
scores of all gapless pairwise alignments of motifs
1-2δ-τ xi
tk , u` , for all 1 ≤ k ≤ a, 1 ≤ ` ≤ b:
δ 1-ε-τ
1-2δ-τ L
M τ E
X
B p
xiyj
s[tk , u` ] = S[tki , u`i ], (1)
i=1
δ 1-ε-τ τ
where S[xi , yj ] is the score of aligning amino acids xi
Y
q yj and yj (they use a standard BLOSUM85 substitution
δ
matrix [8]; motif occurrences are padded to have the
ε same length).
τ In this way we obtain a similarity score between
each pair of motif occurrences. Next MotifAligner ap-
Fig. 4. A pair HMM for global alignment. Transition prob- plies the Needleman-Wunsch algorithm [10] to T
abilities are defined by three parameters δ, ε, τ , emission
and U , treating motifs as sequence symbols and using
probabilities by matrices p and q.
matrix s as the substitution matrix. In this way we
obtain pairs of aligned zinc fingers between the two
proteins.

2 Profile-profile-pair alignment

In this section, we present a new approach to align-
ment of sequences with repetitive motifs. We adopt an
approach similar to MotifAligner, however, we change
Fig. 5. Example of a profile HMM. States Mk are match the alignment algorithm and the scoring scheme to
states, Ik are insert states and Dk are delete states. States take into account the structure of the repeated motif.
B, E, and D1 . . . D3 are silent, which means that they do For example, the zinc-finger motif (Fig. 2) contains
not generate any characters. several highly conserved positions, among them the
four amino-acids binding the zinc ion (positions 3, 7,
20, 24). These four amino acids are crucial to the func-
sensus motif; these are modeled by additional insert tion of the motif and as such should be used to anchor
and delete states. Given a profile HMM and a se- the whole alignment. However, the fact that these po-
quence, we can again find the most probable state sitions match in the two aligned sequences should not
path, which in this case gives us an alignment of the be very surprising and should not by itself contribute
sequence to the motif represented by the profile HMM. much to the resulting score. On the other hand, there
Note, however, that the profile HMM emits only a sin- are several variable positions, and the differences at
gle sequence; the motif itself is represented directly in these positions will be very informative of the evolu-
the structure and parameters of the model. tionary distance.
To take these issues into account, we have devel-
1.2 MotifAligner approach oped a new profile-profile-pair method (PPP) for

To obtain high quality alignments even on sequences
with highly variable number of zinc finger motifs, Now-
ick et al. developed a pairwise alignment tool called
MotifAligner [11]. To our knowledge, it is the only se-
quence alignment method designed specifically to align
sequences with variable number of repetitive motifs.
Part of our work was inspired by this algorithm.
MotifAligner first uses a profile HMM tool
HMMER [5] and finds all canonical motif occurrences
with statistically significant scores in both input se-
quences. Let T = (t1 , . . . , ta ) and U = (u1 , . . . , ub ) Fig. 6. The profile HMM of random 2000 human zinc fin-
be the sequences of all motif occurrences found by gers from the complete dataset, viewed as a HMM logo [15].
44 Peter Kováč et al.

alignment of individual motifs. The method uses Thus our goal is to compute three paths s∗p , s∗x , s∗y
a combination of two profile HMMs and a pair HMM through the pair HMM and the two profile HMMs
for sequence alignment and aligns the two sequences that would satisfy our constraints and the product of
by finding the best possible path through all three joint probabilities implied by all three models would
models simultaneously. To align the complete protein be maximized:
sequences containing these repeating motifs, we first
align each possible pair of motifs through PPP, com-
(s∗p , s∗x , s∗y ) = arg max score(x, y, sp , sx , sy ), (2)
pute their similarity score, and use a modification of valid
sp ,sx ,sy
a traditional global alignment algorithm, now oper-
ating on individual motif occurrences as a unit. We where score(x, y, sp , sx , sy ) = Ppair (x, y, sp )·
describe the details of the method in the remainder of Pprofile (x, sx ) · Pprofile (y, sy ), where Ppair (x, y, s) is the
this section. joint probability of the state path s aligning
sequences x and y in the pair HMM and Pprofile (x, s) is
2.1 Pairwise alignment of individual motifs the joint probability of the state path s and sequence x
in the profile HMM.
The input to PPP consists of two instances of the re- We obtain an optimal solution using dynamic pro-
peating motif x = x1 . . . xLx and y = y1 . . . yLy , a pro- gramming similar to the Viterbi algorithm used to
file HMM encoding the same motif, and a pair HMM compute the most probable state paths in individual
characterizing the properties of a typical alignment. HMMs. Let S = (Sp , Sx , Sy ) be the triplet of states
Our goal is to align both x and y to a separate copy of of pair and x-profile and y-profile models satisfying
profile HMM and at the same time, use the pair HMM our conditions. We denote V [Sp , Sx , Sy , i, j] the score
as a glue. of the highest scoring state path combination ending
In particular, we are simultaneously seeking the with the triplet S and covering the prefixes x1 . . . xi ,
three paths through the three HMMs that satisfy the y1 . . . yj of the two sequences.
following constraints: The computation of V [Sp , Sx , Sy , i, j] depends on
the types of states Sp , Sx , Sy . For example, if Sp is the
Constraint 1 (Profile match states constraint) match state M of the pair HMM and Sx is the match
If xi and yj are emitted by the same match state Mk state Mk of the profile HMM, then according to our
in their profile models then the pair model has emit xi constraints Sy must be the same match state Mk and
and yj together in the match state M . we have the following recurrence:
Constraint 2 (Pair match state constraint) If
the pair model emits xi and yj together in the match V [M, Mk , Mk , i, j] = eM (xi , yj )eMk (xi )eMk (yj )·
state M then both profile models emit xi and yj in the

tM M tM` Mk tM` Mk · V [M, M` , M` , i − 1, j − 1]

same match state Mk or in the same insert state Ik . 

 for 0 ≤ ` < k

t tIk−1 Mk tIk−1 Mk ·V [M, Ik−1 , Ik−1 ,i−1, j −1]

M M

In other words, if the pair model is in the state X


t t M` Mk tMn Mk · V [X, M` , Mn , i − 1, j − 1]

XM

or Y (which is interpreted as a gap in one of the se-


for 0 ≤ `, n < k, n 6= `


quences), the two profile models should not be in the


tXM tM` Mk tIk−1 Mk · V [X, M` , Ik−1 , i−1, j −1]


same match state: symbols belonging to the same con- 
for 0 ≤ ` < k


sensus column should be aligned. However, if both


tXM tIk−1 Mk tM` Mk · V [X, Ik−1 , M` , i−1, j −1]


profile models are in the same insert state they can
max for 0 ≤ ` < k
either be evolutionarily related, in which case they
tXM tIk−1 Mk tIk−1 Mk ·V [X, Ik−1 ,Ik−1 , i−1, j −1]


should be aligned using M state of the pair model, 
tY M tM` Mk tMn Mk · V [Y, M` , Mn , i − 1, j − 1]


or they could have been inserted in the sequence inde- 
for 0 ≤ `, n < k, n 6= `


pendently, which would correspond to using X and Y


tY M tM` Mk tIk−1 Mk · V [Y, M` , Ik−1 , i − 1, j − 1]


states of the pair model. Constraint 2 also implies, that 
for 0 ≤ ` < k


if the profile models are neither in the same match


tY M tIk−1 Mk tM` Mk · V [Y, Ik−1 , M` , i−1, j −1]


state Mk nor in the same insert state Ik (i.e. either


for 0 ≤ ` < k


are in completely different columns or in the same col-


tY M tIk−1 Mk tIk−1 Mk ·V [Y, Ik−1 ,Ik−1 , i−1, j −1]

umn k, but different states Mk and Ik ), which means
that the symbols being emitted are unrelated, then The value at V [M, Mk , Mk , i, j] has to include the
the pair model should not be in the match state. These emission probabilities of xi and yj in all three models.
constraints thus ensure that the sequence and the pro- Then we take a maximum over all choices of previous
file alignment can be interpreted in a consistent man- cells from which the current cell value could be com-
ner. puted. Every value considered in the maximum is the
Aligning sequences with repetitive motifs 45

product of the value of the predecessor cell and tran- Number of Finger Motifs
sition probabilities in all three models. All the other Genome Genes Variants Total Average Median
cases can be derived analogously; we omit the deriva- hg19 612 1071 13363 12.48 12
tions due to the space constraints. mm9 302 513 5226 10.19 10
canFam2 477 828 9259 11.18 11
Every time we compute a value for any cell, we keep
rheMac2 578 1010 12143 12.02 12
a pointer to the cell from which the value was derived.
We use those pointers later to trace back the resulting Table 1. The complete dataset, based on genes from the
state paths. Of particular importance is the path in the whole human genome. One gene can have multiple variants
pair model, since it defines the alignment of x and y. that differ in organization of zinc fingers.
For each of the resulting state path, we also compute
its joint probability its respective model, obtaining val-
ues Ppair (x, y, s∗p ), Pprofile (x, s∗x ), and Pprofile (y, s∗y ).
and the length of motifs is almost the same, so we
can say that Lx , Ly , L = O(m) and hence the time re-
2.2 Alignment of complete motif arrays quired to compute the alignment of one motif pair is
6
We use the same procedure as MotifAligner for align- O(m ). From the same observation, one can easily see
2 4
ment of complete motif arrays. We compute all pair- that the space complexity is O(n + m ). The running
wise alignments of individual motifs, where the score time and memory is practical, since values of n and m
of a pairwise motif alignment is based on joint proba- tend to be small in real proteins (for zinc-finger arrays,
bilities of motif sequences and state paths in all three both n and m are less than 30).
models as described below. Since we perform the mo-
tif alignment for all pairs of motifs and assign a score
to each such alignment, we get a scoring system simi- 3 Experiments and evaluation
lar to a scoring matrix. Treating motifs as symbols and
using this scoring matrix, we obtain the full alignment Gold standard data set. We evaluated our approach
of input motif arrays using Needleman-Wunsch algo- on human zinc-finger genes and their counterparts in
rithm. related species macaque, mouse, and dog. We down-
More formally, for motif arrays Ax = (x1 , . . . , xn ) loaded the set of annotations of KRAB zinc finger
and Ay = (y1 , . . . ym ), we calculate n × m matrix S, genes from the Human KZNF Catalog [9] and
where remapped the annotation to the current hu-
Ppair (xi , yj , s∗p,i,j ) man genome assembly hg19 using liftOver tool. To
S(xi , yj ) = ln , (3) obtain the sequences of these zinc finger genes in other
Pprofile (xi , s∗x,i,j )Pprofile (yj , s∗y,i,j ) species, we used the whole genome alignments from
where sp,i,j , sx,i,j and sy,i,j are the three state paths the UCSC genome browser [17] as a mapping between
∗ ∗ ∗

computed when aligning motifs xi , yj by PPP. This the human (hg19) and the macaque (rheMac2),
score compares the hypothesis that the two motif se- mouse (mm9), and dog (canFam2) genomes.
quences are related (given by probability from the pair The resulting genomic sequences were translated
HMM) to the hypothesis that these are simply two into amino acid sequences and cleaned for apparent
independent sequences following the same profile (as artifacts. In particular, we removed genes that con-
determined by scores from the two profile HMMs). tained fingers shorter than 10 amino acids. Summary
statistics of the resulting dataset is shown in Table 1.
Because of relatively high time complexity of the PPP
2.3 Algorithm complexity algorithm, alignment of genes with high number of fin-
The time complexity of the PPP algorithm on two gers takes a lot of time. For that reason, we prepared
motif arrays with O(n) motifs, each of length O(m) a subset of the complete dataset, omitting genes from
is O(n2 m6 ). There are O(n2 ) individual motif align- human chromosome 19 and their putative orthologs in
ments. The time needed to compute one such align- other genomes. These genes contain the highest num-
ment is O(Lx Ly L4 ), where Lx , Ly are the lengths of bers of repeating motifs (30 or more). Summary statis-
motifs and L is the number of columns in the pro- tics for this restricted dataset are shown in the Table 2.
file HMM. This follows from the observation that in
the recurrent step of individual motif alignment we Model parameters. The emission probabilities of the
fill 3 × L × L × Lx × Ly matrix, and time required to pair HMM used in our experiments were based on the
compute each cell is at most O(L2 ), the upper bound BLOSUM85 substitution matrix. This particular ma-
on the number of values considered in the recurrence. trix was chosen in order to compare our results to Mo-
Typically, the number of columns in the profile HMM tifAligner [11]. In particular, we used the probability
46 Peter Kováč et al.
Histogram of ppp.related[, 2] − (ppp.related[, 3] + ppp.related[, 4]) Histogram of ppp.unrelated[, 2] − (ppp.unrelated[, 3] + ppp.unrelated[, 4])

Number of Finger Motifs

80
Genome Genes Variants Total Average Median

80
hg19 323 510 5249 10.29 9

60
mm9 201 314 2818 8.97 8

Frequency

Frequency
canFam2 257 406 3710 9.14 8

40
rheMac2 305 484 4766 9.85 9

20
Table 2. The restricted dataset, omitting genes from hu-

0
man chromosome 19 and their orthologs. −40 −20 0 20 40 60 80 −40 −20 0 20 40 60 80

Score Score

Fig. 7. Score distributions in related (left) and random
distributions p and q from which the matrix was de- (right) datasets om PPP model.
Histogram of nowick.related[, 1] Histogram of nowick.unrelated[, 1]

rived, as supplied in EMBOSS software package [13].

30
The transition probability parameters (see Figure 4)

25
were set as follows: τ = 0.0345, so that the expected

20
length of an alignment is 28, which is the length of

60
Frequency

Frequency
a typical human C2 H2 zinc finger motif; δ = 0.05185

15
40
so that the expected length of a match region is 13.45,

10
because the most variable region of a zinc finger motif 20

5
spans positions 12-15; ε = 0.4769 so that the expected
0

0
length of a gap is 1.1. −50 0 50

Score
100 150 200 −50 0 50

Score
100 150 200

The complete parameter set of the profile model
was acquired from the Pfam database entry for the Fig. 8. The score distributions in related (left) and ran-
ZNF C2 H2 family [1]. The length of the profile is 23, dom (right) datasets, MotifAligner approach based on the
which is shorter than a typical human zinc finger mo- BLOSUM85 matrix.
tif. The reason is that the model is based on a more
diverse set of sequences from various species.
where the related set was treated as positives and the
random set as negative examples. The classification
3.1 The PPP score distribution performance of the PPP is clearly better, demonstrat-
ing that our scheme is more suitable as a score for
To compare the scoring function of the PPP model classification of paired motifs from random pairs.
with the scores used by MotifAligner, we created two
sets of zinc-finger motif pairs. The related set con-
3.2 Alignment accuracy
tained 1000 fingers from the human genome, each
paired with the corresponding finger from macaque, Next we use the PPP and the simpler MotifAligner
mouse or dog. The random set contained 1000 ran- method for scoring pairs of zinc fingers as building
dom pairs of fingers; we assume that these fingers are blocks in the whole motif array alignment. The
on average more distantly related to each other than Needleman-Wunsch algorithm for the whole motif ar-
paired fingers in the first set. ray alignment has three parameters: the gap opening
We have computed a PPP alignment of each se- penalty g, the gap extension penalty e, and the sub-
quence pair in both samples. The score distributions stitution matrix s that scores individual motif align-
are shown in Figure 7. Both distributions resemble the ments. For MotifAligner, we have used the original
normal distribution, with mean of the related set close parameters [11], in particular the BLOSUM85 substi-
to 20 and mean of the random set at around 5. tution matrix and the gap penalties set to g = 84 and
For comparison purposes, we have reimplemented e = 75.6. In the PPP model, the matrix s is deter-
MotifAligner algorithm as described in [11]. Figure 7 mined by the equation 3, and we have tested several
shows the score distributions of the MotifAligner ap- different settings of the parameters g and e.
proach to alignment of individual motifs, based on the We carried out three tests. In the first one, we
BLOSUM85 substitution matrix. These score distribu- aligned all zinc finger arrays of orthologous proteins in
tions do not resemble the normal distribution. In par- the complete dataset. The second and the third exper-
ticular, the distribution for the related set has a heavy iments simulated a loss of fingers during the evolution
tail, which is clearly not desirable. – we created two artificial datasets with 1/5 and 1/3
The important property of the scoring scheme is of the total number of fingers removed in each zinc
how well it is able to distinguish positive examples finger array in all four genomes, and we aligned the
from negative ones. Figure 9 shows a ROC curve, original human dataset with the four reduced sets.
Aligning sequences with repetitive motifs 47

the only existing program specifically designed to align
sequences with repetitive motifs.
There is still a room for improvement of our work.
Apart from obvious upgrades, like a more efficient im-
True Positive Rate

plementation, the underlying model can be enhanced
in several ways. For example, an interesting question is
whether some other scoring function of individual mo-
tif alignments would perform better. Such a function
might be based on different properties of the under-
PPP lying models, e.g. the full probability of a sequence,
MotifAligner
instead of the probability of the Viterbi path.
False Positive Rate To alleviate problems caused by the computational
complexity of the algorithm, various heuristics could
Fig. 9. ROC curve for related (positive) and random (neg- be applied, especially methods avoiding exhausting
ative) datasets. computations of the whole dynamic programming ma-
trix. In order to apply our model to other protein fami-
Aligned Misaligned lies with repeating motifs, a more robust procedure for
Dataset Program arrays motifs parameter estimation should be established. In addi-
Complete, Unchanged MotifAligner 2161 234 tion, a method for assessment of statistical significance
Complete, Unchanged PPP1 2161 178 of alignments may be helpful when computing align-
Complete, Unchanged PPP2 2161 331 ments of large datasets where random similarities are
Restricted, 1/5 Loss MotifAligner 1609 149 more likely to occur.
Restricted, 1/5 Loss PPP1 1609 139 The model we have implemented is not the only
Restricted, 1/5 Loss PPP2 1609 142 way of doing sequence alignment with repetitive mo-
Restricted, 1/3 Loss MotifAligner 1651 169 tifs. It is very appealing to use a monolithic proba-
Restricted, 1/3 Loss PPP1 1651 254 bilistic model instead of multiplying probabilities of
Restricted, 1/3 Loss PPP2 1651 252 three separate models. We have tried to develop such
Table 3. The comparison of MotifAligner and PPP model. a model, but we were not able to overcome some of
PPP1 refers to Needleman-Wunsch gap penalty parame- their intrinsic difficulties.
ters set to g = 30, e = 20 and PPP2 to g = 20, e = 10. From the practical point of view, the most serious
The third column lists the number of different zinc fin- problem we have encountered is the lack of reliable
ger array pairs aligned; fourth column lists the number of benchmark for assessing the accuracy of alignments
wrongly aligned motifs. with repetitive motifs. A high quality reference is very
valuable, because it allows exact evaluation of algo-
rithms and can give a clue where are the weak and
the strong parts of a particular method, or how to
In our tests, we achieved the best results when the set the method parameters to ensure optimal perfor-
gap opening penalty g was set to 30 and the gap ex- mance. We hope that our work will at least partially
tension penalty e to 20. The results of all tests are serve as a catalyst towards the creation of such a re-
shown in the Table 3. PPP1 was able to out- source.
perform the MotifAligner on the Unchanged and
1/5 Loss datasets. On the other hand, our model per-
formed slightly worse as the number of lost fingers was References
increased.
1. A. Bateman, S. Boehm, E. L. L. Sonnham-
mer, F. Gago: Mulitple sequence align-
ment of zinc finger C2H2 type family. Pfam
4 Conclusion Family: zf-C2H2 (PF00096), 2011. Online.
http://pfam.sanger.ac.uk/family/PF00096.
2. E. J. Bellefroid, D. A. Poncelet, P. J. Lecocq, O. Reve-
We have designed and implemented an algorithm for
lant, J. A. Martial: The evolutionarily conserved
alignment of sequences with repetitive motifs. The al-
Krppel-associated box domain defines a subfamily of
gorithm is built on top of two types of hidden Markov eukaryotic multifingered proteins. Proc. Natl. Acad.
models. It utilizes positional information from two Sci. U.S.A. 88(9), 1991, 3608–3612.
copies of a profile HMM and uses a pair HMM to align 3. G. Ding, P. Lorenz, M. Kreutzer, Y. Li, H. J. Thiesen:
the motif sequences. We were able to apply our model SysZNF: the C2H2 zinc finger gene database. Nucleic
on real world data, and obtained better results than Acids Res. 37 (Database issue), 2009, D267–273.
48 Peter Kováč et al.

4. R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological
sequence analysis. 1st edition. Cambridge University
Press. 1998, 356 p., ISBN: 978-0521629713.
5. S. R. Eddy: Accelerated profile HMM searches. PLoS
Comput. Biol. 7(10), 2011, e1002195.
6. I. Elias: Settling the intractability of multiple align-
ment. Journal of Computational Biology 13(7), 2006,
1323–1339.
7. A. T. Hamilton, S. Huntley, M. Tran-Gyamfi,
D. M. Baggott, L. Gordon, L. Stubbs: Evolutionary
expansion and divergence in the ZNF91 subfamily of
primate-specific zinc finger genes. Genome Res. 16(5),
2006, 584–594.
8. S. Henikoff, J. G. Henikoff: Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad. Sci.
U.S.A., 89(22), 1992, 10915–10919.
9. S. Huntley, D. M. Baggott, A. T. Hamilton, M. Tran-
Gyamfi, S. Yang, J. Kim, L. Gordon, E. Branscomb,
L. Stubbs: A comprehensive catalog of human KRAB-
associated zinc finger genes: insights into the evolu-
tionary history of a large family of transcriptional re-
pressors. Genome Res., 16(5), 2006, 669–677.
10. S. B. Needleman, C. D. Wunsch: A general method ap-
plicable to the search for similarities in the amino acid
sequence of two proteins. J. Mol. Biol., 48(3), 1970,
443–453.
11. K. Nowick, C. Fields, T. Gernat, D. Caetano-Anolles,
N. Kholina, L. Stubbs: Gain, loss and divergence in
primate zinc-finger genes: a rich resource for evolution
of gene regulatory differences between species. PLoS
ONE 6(6), 2011, e21553.
12. K. Nowick, A. T. Hamilton, H. Zhang, L. Stubbs:
Rapid sequence and expression divergence suggest se-
lection for novel function in primate-specific KRAB-
ZNF genes. Molecular Biology and Evolution 27(11),
2010, 2606–2617.
13. P. Rice, I. Longden, A. Bleasby: EMBOSS: the Eu-
ropean molecular biology open software suite. Trends
Genet., 16(6), 2000, 276–277.
14. D. Schmidt, R. Durrett: Adaptive evolution drives the
diversification of zinc-finger binding domains. Mol.
Biol. Evol. 21(12), 2004, 2326–2339.
15. B. Schuster-Bockler, J. Schultz, S. Rahmann: HMM
Logos for visualization of protein families. BMC Bioin-
formatics 5, 2004, 7.
16. J. H. Thomas, R. O. Emerson: Evolution of C2H2-zinc
finger genes revisited. BMC Evol. Biol. 9, 2009, 51.
17. UCSC Human Genome Feb. 2009 (hg19,
GRCh37) Pairwise Alignments. Online.
http://hgdownload.cse.ucsc.edu/downloads.html.
18. R. Urrutia: KRAB-containing zinc-finger repressor
proteins. Genome Biology 4(10), 2003, 231.
19. A. J. Viterbi: Error bounds for convolutional codes and
an asymtotically optimum decoding algorithm. IEEE
Transactions on Information Theory IT-13, 1967, 260–
267.