<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>OB-Fold Recognition Combining Sequence and Structural Motifs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Macko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Králik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇa Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Vinarˇ</string-name>
          <email>vinar@fmph.uniba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University in Bratislava Mlynská dolina</institution>
          ,
          <addr-line>842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>18</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Remote protein homology detection is an important step towards understanding protein function in living organisms. The problem is notoriously difficult; distant homologs can often be detected only by a combination of sequence and structural features. We propose a new framework, where important sequence and structural features are described by the user in the form of a descriptor, and the descriptor is then used to search a database of protein sequences and score potential candidates. We develop algorithms necessary to support such search using support vector machines and discrete optimization methods. We demonstrate our approach on the example of the telomere-binding OB-fold domain, showing that not only we can distinguish between Telo_bind family members and negatives, but we also identify proteins from related protein families carrying similar OB-fold domains. Prototype implementation of the descriptor search software is available for Linux operating system at http://compbio.fmph.uniba.sk/descal/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Remote homology detection is a key to understanding the
role of individual proteins in living organisms. This
problem is notoriously difficult; the most commonly used tools
build profiles from groups of related proteins, representing
preferred amino acids at individual loci (e.g. [
        <xref ref-type="bibr" rid="ref1 ref24 ref3">1, 6, 24</xref>
        ]).
However, distant homologs are difficult to detect by
sequence alone, since the function of a protein is largely
determined by its 3D structure. Methods combining
structural and sequence-based elements can therefore achieve
higher sensitivity [
        <xref ref-type="bibr" rid="ref15 ref16 ref19 ref29">29, 19, 17, 4</xref>
        ].
      </p>
      <p>
        A similar problem is encountered in search for RNA
genes, where considering secondary RNA structure is
essential to finding distant homologs of known genes. In
addition to fully-automated systems for such tasks [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ],
success was achieved by tools allowing expert human users
to handcraft motif descriptors representing the most
important features of the target RNAs [
        <xref ref-type="bibr" rid="ref17 ref21 ref22 ref28 ref5">8, 5, 22, 28, 21</xref>
        ].
Such descriptors specify restrictions on the base-pairing
structure of the target RNA (characterizing important
secondary structure features), as well as sequence constraints
in the form of regular expressions (characterizing
important conserved functional sites).
      </p>
      <p>In this work, we propose to extend such
descriptorbased approach to protein homology search. However,
proteins do not have an equivalent of the simple
deterministic rules for RNA base-pairing, and sequence constraints
are more naturally written in the form of profiles rather
than simple regular expressions. For these reasons, our
approach combines techniques from machine learning
(support vector machines), probabilistic modeling (sequence
profiles), and manual selection of important structural
features.</p>
      <p>In particular, as the first step, the user creates a
descriptor characterizing the most important sequential and
structural features of a given protein or a protein family. In the
second step, we use our algorithm to score individual
proteins (e.g., all proteins in a particular organism) based on
how well the descriptor fits these proteins; the score
combines sequential features, secondary structure features, and
interactions between individual structural elements.
Finally, the candidate proteins can be ordered based on this
score and the highest scoring candidates will be
considered as homologs of the original protein.</p>
      <p>
        Consider an example of the telomere binding OB-fold
protein CDC13 in Saccharomyces cerevisiae. The
important structural elements of this protein have been well
characterized [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and are outlined in Fig. 1. Secondary
structure of the telomere-binding OB-fold domain is
composed of five β -strands and two α-helices. The β -strands
form a typical β -barrel structure. Even though large
sequence divergence is typical for this domain, several
sequence sites are strongly conserved. To search for putative
CDC13 homologs in various species, we propose to
describe all these features in a single descriptor, as shown
in Fig. 2. By screening a protein database and scoring
individual proteins based on this descriptor, we can see
that relevant homologs (those containing telomere binding
domain) are scored the highest, the proteins from related
families have moderate scores, and unrelated proteins have
generally low scores (Fig. 5). Thus, the highest scoring
proteins are potential candidates for functional homologs
of CDC13 in other species.
      </p>
      <p>The paper is organized as follows. First, we describe
general framework of descriptors characterizing sequence
and structural features of a protein domain and illustrate
it on the telomere-binding OB-fold domain. An
important feature of these descriptors is identification of
potential bonded β -strands. We have developed a support
vector machine based classifier for this task. Next, we
describe two algorithms for descriptor search in protein
sequences. Finally, we evaluate our method on the example
of telomere-binding OB-fold proteins, as outlined in the
previous paragraph.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Protein Domain Descriptors</title>
        <p>
          To search for occurrences of a known protein domain,
we propose to characterize the domain in the form of a
descriptor inspired by descriptors used in RNA structure
search [
          <xref ref-type="bibr" rid="ref21 ref7">10, 21</xref>
          ]. The main idea is to divide the whole
domain into segments corresponding to secondary structure
elements; these segments have fixed order along the
sequence. Each segment is characterized by the minimum
and maximum allowed length and the secondary structure
class (α-helix, β -strand or coil). For each segment, it is
also possible to provide a short sequence motif in the form
of a position-specific scoring matrix (PSSM). An
important aspect is the ability to specify interactions between
distant segments of the protein. In our descriptor, we
allow specification of hydrogen bonds between individual
β -strand segments which can be parallel or anti-parallel;
we also specify the minimum number of hydrogen bonds.
        </p>
        <p>Most constraints specified by the descriptor are soft; we
allow arbitrary consecutive placement of descriptor
segments on the query protein subject only to the length
constraints. Each such alignment of the descriptor to the
protein obtains a score according to the scoring scheme
described below, and the score of the protein is the score
obtained by the best alignment. All examined proteins are
then ranked by their scores, and the user can examine
selected proteins from the top of the list, or choose a suitable
cutoff score for protein classification.</p>
        <p>The scoring scheme consists of three components which
are combined to the overall score by a linear
combination with suitable weights. The first component measures
the agreement of the desired secondary structure elements
with the predicted secondary structure of the query
protein. The score s j of segment j placed at positions k . . . ℓ
is the sum s j = ∑iℓ=k ln(pi + 0.5), where pi is the posterior
probability of the desired secondary structure type at
position i. In this way, we prefer alignments that agree with
the predicted secondary structure, while at the same time
we tolerate unavoidable errors in the secondary structure
prediction.</p>
        <p>
          The methods for estimating posterior probabilities of
each position in the protein being either α-helix, β -sheet,
or a coil have been previously developed, and we use
PSIPRED [
          <xref ref-type="bibr" rid="ref10">13</xref>
          ] to estimate them.
        </p>
        <p>The second component of the score evaluates the
agreement of the sequence with the specified sequence motif
in each segment. The motif is given by a PSSM
containing a log-odds score for every amino acid at each position
within the motif. We use PSSMs extracted from strongly
conserved regions of PFAM profiles. In general, the motif
is shorter than the minimum segment length, and we use
the score of the best-scoring ungapped alignment of the
PSSM within the particular sequence segment.</p>
        <p>Finally, the third score component characterizes the
propensity of two β -strands to form hydrogen bonds (we
call such β -strands interacting). In the next section, we
describe a sequence-based classifier that estimates whether
two amino acids are likely to form a hydrogen bond in the
context of two interacting β -strands; we denote the
resulting score for positions i and j as bond(i, j). For a pair
of segments required to interact through k parallel
hydrogen bonds, we find positions i and j within the segments
such that the score mi, j = ∑ℓ=−01 bond(i + 2ℓ, j + 2ℓ) is
maxk
imized. (We proceed similarly for anti-parallel interacting
β -strand segments.) Note that orientation of amino acids
in a β -strand typically alternates, and therefore we skip
one position between adjacent bonds. Even though a β
strand as a whole can interact with two other β -strands,
we require that each amino acid is involved in at most one
hydrogen bond.</p>
        <p>
          Figures 1 and 2 show an example of a descriptor for the
telomere binding OB-fold domain. The descriptor
contains ten segments, out of which five are β -strands and one
is an α-helix. Six of the segments contain sequence motifs
corresponding to strongly conserved sections of the Pfam
model for the Telo_bind domain. The descriptor also
specifies anti-parallel interactions between β -strands forming a
β -barrel.
An important part of the descriptor language is the
ability to specify the pattern of hydrogen bonding between
individual β -strands that are possibly quite distant in the
primary sequence. The use of β -strand interactions has
been shown to improve the accuracy of fold recognition
in β -strand rich proteins [
          <xref ref-type="bibr" rid="ref15 ref16">17, 4</xref>
          ]. Several grammar-based
methods for recognition of hydrogen bond structure of β
sheets were proposed in the context of protein structure
prediction [
          <xref ref-type="bibr" rid="ref13 ref14 ref27">16, 27, 3</xref>
          ]. Another approach to predicting
B1|C1|B2|C2|B3|A1|C3|B4|C4|B5
B1: 6 15 B1_LOGO
C1: 2 80
B2: 6 10 B2_LOGO
C2: 2 40 C2_LOGO
B3: 7 10 B3_LOGO
A1: 4 30
C3: 1 10
B4: 8 15 B4_LOGO
C4: 2 100 C4_LOGO
B5: 3 15
- B1 B2: 3
- B2 B3: 3
- B4 B5: 3
- B1 B4: 3
***********LOGO_DEFINITIONS*************
B1_LOGO: [
          <xref ref-type="bibr" rid="ref17">5</xref>
          ]
1.4737 0.5703 1.4760 1.1420 ...
1.5272 0.8311 1.2353 1.2990 ...
-0.6192 -0.2024 0.5459 -1.0728 ...
1.1771 0.8718 2.0237 -0.0035 ...
1.0375 0.8607 1.8582 1.3712 ...
        </p>
        <p>
          B2_LOGO: [
          <xref ref-type="bibr" rid="ref14">3</xref>
          ]
-0.7001 -0.7778 -0.9399 0.4729 ...
0.4238 1.1190 0.6647 -0.3823 ...
-1.1356 -0.6612 -0.3739 -1.4812 ...
        </p>
        <p>
          C2_LOGO: [
          <xref ref-type="bibr" rid="ref3">6</xref>
          ]
-1.2004 -1.2746 -1.0686 -1.5443 ...
...
        </p>
        <p>
          B3_LOGO: [
          <xref ref-type="bibr" rid="ref3">6</xref>
          ]
1.4816 1.4929 0.9900 1.1481 ...
...
the topology of β -sheets and interstrand β -residue
pairings uses neural networks [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          We have created two classifiers to determine if two
putative β -strands are likely to form hydrogen bonds, one
for parallel and one for anti-parallel strands. The input to
each classifier consists of two sequence windows of length
5. The classifier estimates whether the middle amino acids
in these windows are likely to form a hydrogen bond with
each other. The classifier has the form of a support
vector machine (SVM) [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. We first convert the two
sequence windows to a numerical feature vector of length
201. Each sequence position is represented by 20 binary
features. One of these features is always set to one and
the remaining 19 features are set to zero, depending on the
encoded amino acid.
        </p>
        <p>The last feature is the log-odds score for the
interaction of the two middle amino acids. In particular, by using
our training set, we have estimated frequencies fa,b with
which pairs of amino acids a and b occur among hydrogen
bonds between interacting β -strands, and frequencies fx
with which amino acids x occur in β -strands individually.
If the two amino acids in the middle of the evaluated
windows are a and b, then the log-odds score will be defined
as sa,b = log fa,b .</p>
        <p>fa fb</p>
        <p>
          To create a training set, we clustered sequences in the
PDB database [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to clusters with 90% sequence
similarity by software CD-Hit [
          <xref ref-type="bibr" rid="ref12">15</xref>
          ]. From each cluster we
have selected only one sequence for further processing,
thus obtaining a representative sample of the proteins in
the database. (Without this preprocessing, we would over
sample from few large clusters of very similar proteins.)
        </p>
        <p>
          In addition to the sequence and to the secondary
structure annotations (all of which is contained in the PDB
database), we also need to determine hydrogen bonds in
selected sequences. These were calculated using Jmol
Viewer [
          <xref ref-type="bibr" rid="ref8">11</xref>
          ]. A positive sample is a pair of sequence
windows of length 5 taken from two beta strands of the same
protein that have their middle amino acids connected by
a hydrogen bond and that have at least one other
hydrogen bond between endpoints of the two windows. Parallel
or anti-parallel orientation of the windows is determined
based on this second bond. A negative sample is a pair of
windows from two different β -strands in the same protein
that are not connected by any hydrogen bond.
        </p>
        <p>We have selected a random subset of 160,000 positive
and 766,239 negative samples for the anti-parallel model
and 86,887 positive and 434,435 negative samples for the
parallel model. Testing sets for both models contained
15,000 positive and 75,000 negative samples and did not
overlap the training set.</p>
        <p>
          By using a small validation set that did not overlap the
training or testing set, we have explored a variety of
kernels for the SVM by using software SVM-light [
          <xref ref-type="bibr" rid="ref9">12</xref>
          ]. Fig.3
shows the accuracy of the models with different SVM
kernels. Our final choice was the polynomial kernel of
degree 7.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Descriptor Alignment as an Integer Linear Program</title>
        <p>
          We will call the task of finding the best placement of
individual descriptor segments on the query protein
sequence the descriptor alignment problem. Since the
scoring scheme includes long-range interactions between
segments, and the positions of interacting segment pairs
within the descriptor are not constrained, this problem is
NP-hard, similarly to protein threading [
          <xref ref-type="bibr" rid="ref11">14</xref>
          ] and RNA
descriptor search with pseudoknots [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>Antiparallel polynomial
Parallel polynomial
Antiparallel radial</p>
        <p>Parallel radial
0.0
0.2
0.4
0.6
0.8</p>
        <p>1.0</p>
        <p>False positive rate</p>
        <p>We therefore formulate the problem as an integer linear
program (ILP) and use existing ILP solvers (CPLEX) to
find the optimal solution. For simplicity, we show only
formulation for parallel interacting β -strands; antiparallel
strands are analogous. Our formulation uses the following
binary variables:
• Variable xis indicates whether position i is covered by
segment s.
• Variable mis indicates whether position i is the
starting position of the motif alignment within segment
s.
• Variable yis jt indicates whether positions i and j are
the first in a chain of hydrogen bonds between
(parallel) interacting segments s and t.
• Variable pist indicates whether position i is involved
in a hydrogen bond between segments s and t.</p>
        <p>We add hypothetical segments 0 and m + 1 that fill the gap
at the beginning and at the end of the sequence, but do
not contribute to the score. The goal of the optimization
is to maximize the score for the alignment given by the
variables:
∑(eisxis + fismis) +
i,s
∑ gis jt yis jt
i, j,s,t
Coefficients eis, fis, and gis jt are precomputed according
to the scoring function, including the weights of
individual components. The optimization is subject to linear
constraints shown in Figure 4. Constraints P1-P3 ensure that
all the segments are placed consecutively and in the
correct order onto the protein sequence. The length of
segment is constrained by L1. The motif occurences must
be placed within their corresponding segments (constraints</p>
        <p>M1-M3). The hydrogen bonds between a pair of interacting
segments s and t must also lie within these segments
(constraints B1-B3). Finally, each amino acid can be involved
in at most one hydrogen bond (constraints S1, S2).</p>
        <p>We have tested variants of this ILP formulation for
several proteins. Short positive examples can be typically
solved within minutes. However, the running time for
negative examples was usually quite high, and therefore we
were not able to complete more extensive tests with this
approach. Instead, we propose a dynamic programming
algorithm for a slightly simplified version of the problem
as described in the next section.
Since the descriptor alignment problem is NP-hard for
general conformation of interacting segment pairs, we will
solve a special case where the interactions within the
descriptor are limited. In particular, if segments s1 and s2
interact, we do not allow any interactions for segment s such
that s1 &lt; s &lt; s2. Interacting pairs form chains of the form
(s1, s2), (s2, s3),. . . , (sk−1, sk), and different chains occupy
disjoint regions of the descriptor. The descriptor in Fig.2
does not satisfy this restriction because segments B2 and
B3 interact, and they lie within interacting pair (B1, B4). If
we remove pair (B1, B4), the remaining interactions satisfy
the restriction and form two chains [(B1, B2), (B2, B3)] and
[(B4, B5)].</p>
        <p>We will show that under this restriction, the alignment
problem can be solved optimally in polynomial time by
dynamic programming. First, we will consider two
simpler problems. If there are no interactions in the
descriptor, we can use straightforward dynamic programming as
follows. Let A[s,t] be the score of the best alignment of
the first s segments of the descriptor, where last of them
ends at the sequence position t. This value can be
computed using the values for the first s − 1 segments ending
at positions t′ &lt; t: A[s, t] = max f A[s − 1, f − 1] + S[s, f , t].
In this formula, S[s, f , t] is the segment score for segment
s extending from position f to t. This score includes the
secondary structure and sequence motif scores, combined
with appropriate weights.</p>
        <p>We will now extend this dynamic programming to the
case where we allow restricted interaction configurations
as described above. However, we will not enforce the
condition that a single amino acid can only be used in a single
hydrogen bond. To accommodate interactions, we need to
include the score for the best placement of hydrogen bonds
between interacting segments. Let B[s, f ′,t′, f ,t] be the
interaction score between segment s and its interaction
partner s′ &lt; s if s extends from f to t and s′ extends from f ′ to
t′. This score is precomputed by finding the highest
scoring position of hydrogen bonds within the two segments.
In order to incorporate such interaction scores to our
dynamic programming, we need to increase the dimension of
matrix A to keep track of the position of s′.</p>
        <p>If segments s1 and s2 interact and s is a segment such
that s1 ≤ s &lt; s2, we say that s1 is an open segment for s.
Under our restriction on descriptors, each segment s has
at most one open segment. We can define the subproblem
of the dynamic programming A[s,t, f ′, t′] as the score of
the best alignment of the first s segments of the descriptor,
where segment s ends at position t and the open segment
for s starts at f ′ and ends at t′. We compute this value from
values for s − 1, distingushing the following four cases.</p>
        <p>In the first case, s interacts with two other segments s1
and s2, where s1 &lt; s &lt; s2. Then s is its own open
segment, and therefore s starts at f ′ and ends at t′ = t. We
maximize over possible values f ′′ and t′′ that represent
start and end of segment s1 (which is the open segment
for s − 1): max f ′′,t′′ A[s − 1, f ′ − 1, f ′′,t′′] + S[s, f ′,t′] +
B[s, f ′′,t′′, f ′, t′].</p>
        <p>In the second case, s interacts with one segment s1,
where s1 &lt; s. Then s does not have an open segment,
and we only consider values f ′ = t′ = ⊥. We maximize
over all possible values of f , f ′′ and t′′, where f ′′ and t′′
represent start and end of segment s1, and f is the start
of segment s: max f ′′,t′′, f A[s − 1, f − 1, f ′′,t′′] + S[s, f ,t] +
B[s, f ′′,t′′, f , t].</p>
        <p>In the third case, s interacts with one segment s2, where
s2 &gt; s. Again, s is its own open segment, and
therefore we require t = t′. On the other hand, s − 1 does not
have an open segment, and thus we do not need to
maximize over any values, obtaining the equation A[s, t, f ′,t′] =
A[s − 1, f ′ − 1, ⊥, ⊥] + S[s, f ′, t].</p>
        <p>The last case occurs when s does not interact with any
segment. It may have an open segment s′ &lt; s, which
is then also the open segment for s − 1, or it does not
have any open segment, which means that f ′ = t′ = ⊥.
We maximize over all possible starts f of segment s:
max f A[s − 1, f − 1, f ′,t′] + S[s, f , t].</p>
        <p>Finally, we further extend our algorithm to enforce that
each amino acid is involved in at most one hydrogen bond.
Let (s1, s2) and (s2, s3) be two interacting pairs of
segments sharing segment s2. When choosing bond positions
for (s2, s3), we need to know which positions were already
used for bonds in (s1, s2) and thus cannot be used again.
To do this, we introduce new parameter b′ into our table
A[s, t, f ′,t′, b′]. Parameter b′ is the position of the first
hydrogen bond within the open segment s1 of segment s.
Other positions in s1 used by bonds can be determined
based on the required number of bonds and orientation
specified in the descriptor. Computation of values in
table A needs to distinguish six cases depending on the type
of segment s, similarly as before. The two extra cases arise
from the need to keep track whether the open segment for
segment s − 1 has restricted positions or not. We omit the
full recurrence, which can be derived by carefully
extending the formulas above.</p>
        <p>The running time of the algorithm is O(nm f ℓ4), where
n is the length of the protein sequence, m is the number
of segments in the descriptor, ℓ is the maximum segment
length, and f is the maximum flexibility of interacting
pairs defined as the difference between the smallest and
the largest possible distance between their ends t and t′.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.5 Implementation Details</title>
        <p>To compute interaction scores, we need to run the SVM
predictor for all pairs of windows of length 5 that can be
covered by interacting β -strands. In the worst case, the
number of such pairs grows quadratically with the
protein length, although in practice the flexibility of the
descriptor is limited, thus bounding achievable distance of
these pairs. Nonetheless, computation of all required SVM
values was the most time-consuming part of the dynamic
programing solution. Therefore we have added a
heuristic rule which allows hydrogen bonds only between amino
acids that have posterior probability of β -strand secondary
structure from PSIPRED at least 0.5. This rule has
dramatically lowered the computation time. As we will see in the
next section, many negative examples do not have enough
potential placements for hydrogen bonds, and as a result,
no alignment of the descriptor is possible. On the other
hand, this situation happens only very rarely for positive
examples.</p>
        <p>Our scoring scheme allows us to assign different
weights to the three components. Ideally, these weights
would be optimized to improve the prediction accuracy.
For simplicity, we have used weight 1 for secondary
structure and sequence motifs and weight 2 for interactions.
The weight of interactions was increased because values
produced by the SVM were relatively small compared to
the overall score.</p>
        <p>In order to comply with constrains imposed by the
dynamic programming, we omit the interaction between
segments B1 and B4 from the descriptor of the OB-fold
protein domain shown in Fig.2. After computing solution for
the reduced descriptor with the dynamic programming, we
simply try every possible position for the B1-B4 interaction
and include the best one in the overall score.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>To evaluate our descriptor approach, we have used the
descriptor in Fig.2 and our dynamic programming algorithm
to recognize proteins containing the telomere-binding
OBfold domain. Note that in the dynamic programming, we
omit some of the interacting pairs to make the problem
tractable. Even though part of the score corresponding to
missing interactions is later added to the final score, the
alignment of the descriptor obtained by the dynamic
programming may not be optimal.</p>
      <p>We have randomly selected 50 proteins with
telomerebinding OB-fold domain annotated in Pfam (Pfam domain
PF02765 Telo_bind). We have also randomly chosen 50
SWISS-PROT proteins not associated with the PF02765
family as a negative sample. Finally, we have selected
four other families from the OB-fold clan: RNA
polymerase Rpb8 family (PF03870 Rpb8), single-strand
binding protein family (PF00436 SSB), tRNA binding domain
(PF01588 tRna_bind), and eukaryotic elongation factor
5A hypusine (PF01287 elF-5a). From each of these four
families, we have randomly chosen 20 proteins.</p>
      <p>
        The results are summarized in Fig.5. Our descriptor can
reliably recognize Telo_bind proteins from the negative
samples. Only three negative samples had score higher
than 20, with the largest score 38.9. Two positives scored
less than 35, additional two proteins were filtered out in the
secondary-structure filtering. HMMer [
        <xref ref-type="bibr" rid="ref4">7</xref>
        ] achieved perfect
separation between Telo_bind and negatives (Fig.5c), but
this comparison is not fair since the annotation of protein
domains in Pfam is based on the same profile HMM which
was used in this test, and therefore it is not surprising that
it achieves perfect classification.
      </p>
      <p>Our goal is, however, to search for distant homologs that
cannot be reliably recognized by Pfam profiles. The
descriptor search is able to recognize proteins from related
families that also contain OB-folds, and yet at the same
time, it is possible to distinguish them quite successfully
from Telo_bind proteins. On the other hand, HMMer
results cannot distinguish these four additional families from
negatives (compare Fig.5b and c). These results suggest
that it is sensible to use the descriptor search to locate
distant homologs, examining the resulting candidates in order
of the assigned scores. This can be especially beneficial
when sequence-based methods (such as HMMer) fail to
find any matches.</p>
      <p>
        Pot1 and CDC13 are OB-fold telomere binding proteins
that bind single-stranded telomere overhang and are key
players in telomere maintenance. It has been a long
standing question, which protein performs this crucial role in
the pathogenic yeast Candida albicans and related species.
No homolog could be found by common sequence-based
methods [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Yu et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] have demonstrated that a
short protein (Uniprot ID Q5AB98) associates with
telomere DNA and regulates telomere lengths. They postulate
that this is the missing ortholog of CDC13. Search with
our descriptor against this protein produced score of 32.8,
which is on the low end of the range for Telo_bind and
well within range of other OB-fold containing families.
Note that search for Pfam domains in this protein does not
return any significant matches.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have introduced a framework of
combining sequence and structural information in search of
distant protein homologs. Important sequence and
structural features of a given protein or a protein family are
first manually selected and described in the form of a
descriptor which is then used to search a database of protein
sequences and score potential candidates.</p>
      <p>
        We have demonstrated the use of our framework on
the telomere-binding OB-fold domain. Based on the
description of the S. cerevisiae CDC13 protein by
MittonFry et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], we have created a descriptor that includes
the information on the secondary structure elements,
interaction of individual β -strands, and highly conserved
sequence motifs. We have developed an algorithm that
allowed us to score individual proteins with this
descriptor, and we have demonstrated that not only the descriptor
search was able to distinguish between Telo_bind family
members and negative examples, but it also identified
proteins from related families containing similar domains.
      </p>
      <p>There are many avenues for further research in this area.
First, even though our algorithms are universal, we have
mostly targeted the features required to support CDC13
distant homolog search. There are many other features
that could be included within the same algorithmic
framework (e.g., more flexible sequence motifs, irregular
hydrogen bond configurations, flexible distances between
individual elements), while others would require development
of new algorithms (e.g., more complex interaction models
between segments).</p>
      <p>The experience from a similar RNA search framework
suggests that writing sensitive and specific descriptors is
a long iterative process. Our Telo_bind descriptor is only
the first attempt at this task, further examination of results
could suggest which features are perhaps less important
and could be omitted, and which new features should be
included instead. Continuing this work could lead to a
discovery of telomere binding OB-fold proteins in species
where these proteins are yet unknown, and also to greater
understanding of importance of individual features of this
protein. Development of additional tools supporting such
research would be of great interest.</p>
      <p>The scoring function of the descriptor alignment to a
protein is a linear combination of several components. The
overall score is optimized globally, however, the weights
controlling individual contributions of the components
were chosen ad hoc. Systematic choice of these constants,
perhaps through machine learning methods, could lead to
higher accuracy.</p>
      <p>One obstacle to a wider deployment of our current
(a) Descriptor scores
(b) Descriptor ranks
(c) Pfam HMMer ranks
150
100
50
search tool is its running time. The alignment of the
descriptor to a single protein requires anything between
couple of seconds to several hours. In the dynamic
programming, we have sacrificed information provided through
one of the β -strand interactions, and we have further
restricted the search space by discarding segment positions
that did not match the secondary structure constraints well.
Yet, we believe that these relaxations changed the final
result very little. Perhaps further heuristic relaxations and
approximations could lead to a faster search tools.</p>
      <p>Finally, one could imagine that efforts towards
assembling a database of descriptors characterizing common
protein functions could lead to a better and faster
functional annotation of newly sequenced species.</p>
      <p>Acknowledgements. This research was funded by APVV
grant APVV-14-0253 and VEGA grants 1/0719/14 (TV)
and 1/0684/16 (BB).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Altschul</surname>
            ,
            <given-names>S. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaffer</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lipman</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>25</volume>
          (
          <issue>17</issue>
          ),
          <fpage>3389</fpage>
          -
          <lpage>3392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] Cheng, J. and
          <string-name>
            <surname>Baldi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Three-stage prediction of protein β -sheets by neural networks, alignments and graph algorithms</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>21</volume>
          (
          <issue>suppl 1</issue>
          ),
          <fpage>i75</fpage>
          -
          <lpage>i84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A new generation of homology search tools based on probabilistic inference</article-title>
          .
          <source>Genome Inform</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <fpage>205</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Accelerated Profile HMM Searches</article-title>
          .
          <source>PLoS Comput Biol</source>
          ,
          <volume>7</volume>
          (
          <issue>10</issue>
          ),
          <year>e1002195</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Gautheret</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Major</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cedergren</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1990</year>
          ).
          <article-title>Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA</article-title>
          .
          <source>Comput Appl Biosci</source>
          ,
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <fpage>325</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Griffiths-Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Rfam: an RNA family database</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <fpage>439</fpage>
          -
          <lpage>441</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jimenez</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rampasek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brejova</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Luptak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Discovery of RNA motifs using a computational pipeline that allows insertions in paired regions and filtering of candidate sequences</article-title>
          .
          <source>Methods Mol Biol</source>
          ,
          <volume>848</volume>
          ,
          <fpage>145</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jmol</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Jmol: an open-source Java viewer for chemical structures in 3D. www</article-title>
          .jmol.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Making large-scale SVM learning practical</article-title>
          . In B.
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Burges</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Smola, editors,
          <source>Advances in Kernel Methods - Support Vector Learning</source>
          . MIT-Press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>D. T.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Protein secondary structure prediction based on positionspecific scoring matrices</article-title>
          .
          <source>J Mol Biol</source>
          ,
          <volume>292</volume>
          (
          <issue>2</issue>
          ),
          <fpage>195</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lathrop</surname>
            ,
            <given-names>R. H.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>The protein threading problem with sequence amino acid interaction preferences is NP-complete</article-title>
          .
          <source>Protein Eng</source>
          ,
          <volume>7</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1059</fpage>
          -
          <lpage>1068</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Godzik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>22</volume>
          (
          <issue>13</issue>
          ),
          <fpage>1658</fpage>
          -
          <lpage>1659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Mamitsuka</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Abe</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Predicting location and structure of betasheet regions using stochastic tree grammars</article-title>
          .
          <source>ISMB-94</source>
          , pages
          <fpage>276</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Searls</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Grammatical representations of macromolecular structure</article-title>
          .
          <source>Journal of Computational Biology</source>
          ,
          <volume>13</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1077</fpage>
          -
          <lpage>1100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Menke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cowen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid twocomponent sensor system</article-title>
          .
          <source>Proc Natl Acad Sci U S A</source>
          ,
          <volume>107</volume>
          (
          <issue>9</issue>
          ),
          <fpage>4069</fpage>
          -
          <lpage>4074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Daniels</surname>
            ,
            <given-names>N. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hosur</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cowen</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone</article-title>
          .
          <source>Bioinformatics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>RNABob: a program to search for rna secondary structure motifs in sequence databases</article-title>
          .
          <source>unpublished.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mitton-Fry</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glustrom</surname>
            ,
            <given-names>L. W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wuttke</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Structural basis for telomeric single-stranded DNA recognition by yeast Cdc13</article-title>
          .
          <source>J Mol Biol</source>
          ,
          <volume>338</volume>
          (
          <issue>2</issue>
          ),
          <fpage>241</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lundegaard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lund</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>T. N.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <source>CPHmodels-3</source>
          .0
          <article-title>-remote homology modeling using structure-guided sequence profiles</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>38</volume>
          (
          <issue>Web Server issue</issue>
          ),
          <fpage>W576</fpage>
          -
          <lpage>581</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Rampasek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>RNA structural motif search is NP-complete</article-title>
          .
          <source>In Studentska vedecka konferencia FMFI UK</source>
          , Bratislava, pages
          <fpage>341</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rampasek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luptak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Brejova</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>RNA motif search with data-driven element ordering</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ),
          <fpage>216</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Reeder</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reeder</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Giegerich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Locomotif: from graphical motif description to RNA motif search</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>23</volume>
          (
          <issue>13</issue>
          ),
          <fpage>i392</fpage>
          -
          <lpage>400</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          et al. (
          <year>2011</year>
          ).
          <article-title>The RCSB Protein Data Bank: redesigned web site and web services</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>39</volume>
          (Database issue),
          <fpage>D392</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Sadreyev</surname>
            ,
            <given-names>R. I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Grishin</surname>
            ,
            <given-names>N. V.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Accurate statistical model of comparison between multiple sequence alignments</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>36</volume>
          (
          <issue>7</issue>
          ),
          <fpage>2240</fpage>
          -
          <lpage>2248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Teixeira</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gilson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Telomere maintenance, function and evolution: the yeast paradigm</article-title>
          .
          <source>Chromosome Res</source>
          ,
          <volume>13</volume>
          (
          <issue>5</issue>
          ),
          <fpage>535</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V. N.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>The Nature of Statistical Learning Theory</article-title>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Waldispühl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clote</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Steyaert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Predicting transmembrane β -barrels and interstrand residue interactions from sequence</article-title>
          .
          <source>PROTEINS: Structure, Function, and Bioinformatics</source>
          ,
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Webb</surname>
            ,
            <given-names>C.-H. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riccitelli</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruminski</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Luptak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Widespread occurrence of self-cleaving ribozymes</article-title>
          .
          <source>Science</source>
          ,
          <volume>326</volume>
          (
          <issue>5955</issue>
          ),
          <fpage>953</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>RAPTOR: optimal protein threading by linear programming</article-title>
          .
          <source>J Bioinform Comput Biol</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>E. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lue</surname>
            ,
            <given-names>N. F.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Analyses of Candida Cdc13 orthologues revealed a novel OB fold dimer arrangement, dimerizationassisted DNA binding, and substantial structural differences between Cdc13 and RPA70</article-title>
          .
          <source>Mol Cell Biol</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <fpage>186</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>