<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Neuropeptide Recognition by Machine Learning Methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrej Ridzik</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇ a Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics, and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Informatics of the Slovak Academy of Sciences</institution>
          ,
          <addr-line>Dubravská cesta 9, 845 07 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>1214</volume>
      <fpage>72</fpage>
      <lpage>78</lpage>
      <abstract>
        <p>Thanks to advances in DNA sequencing and bioinformatics methods, for many species we know their genomic sequence as well as sequences of proteins produced by their cells. However, the exact function of those proteins is often unknown. The goal of our work is to automatically recognize neuropeptides, special proteins involved in communication between neurons. Neuropeptides are created in a cell from longer protein precursors, and our goal is to determine, if a given protein is a likely precursor and if yes, which of its parts will serve as neuropeptides. Existing methods solve only partial aspects of this problem. Our more comprehensive system uses a combination of support vector machines and conditional random fields.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper, we study the problem of neuropeptide
annotation. Recent advances in DNA sequencing allow us to
obtain DNA sequences of many species. We can then use
computational methods to find likely locations of genes
encoding proteins produced by the organism in question
(Yandell and Ence, 2012). The result is a list of
putative proteins, each characterized by its sequence of amino
acids, which can be represented in a computer as a string
over 20-letter alphabet. However, the exact function of
these proteins is often unknown.</p>
      <p>
        The goal of our work is to automatically recognize
neuropeptides, short proteins acting as neurotransmitters in
communication between neurons and as hormones in
endocrine regulation
        <xref ref-type="bibr" rid="ref7">(Hook et al., 2008)</xref>
        . Neuropeptides are
created in cells from longer protein precursors, and our
goal is to determine, if a given protein is a likely precursor
and if yes, which of its parts will serve as neuropeptides.
      </p>
      <p>
        Several similar methods were already developed, but
each considers only partial aspects of the problem
        <xref ref-type="bibr" rid="ref11 ref14 ref15 ref16">(Southey et al., 2006, 2008b; Ofer and Linial, 2014)</xref>
        . We
will review them, together with further biological
background related to the problem, in the next section.
      </p>
      <p>
        Our system aims to solve the full problem including
recognition of precursors and identification of likely
neuropeptides. For this purpose, we use semi-Markov
conditional random fields (semi-CRF). This is a probabilistic
model for annotating sequences, which generalizes hidden
Markov models (HMMs). Both HMMs and semi-CRFs
were previously used for several other problems in
bioinformatics
        <xref ref-type="bibr" rid="ref2 ref6 ref9">(Burge and Karlin, 1997; Krogh et al., 2001;
DeCaprio et al., 2007)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>As outlined above, a protein can be represented as a string
over a 20-letter alphabet, where each letter represents one
of the 20 amino acids commonly occurring in proteins
produced by living cells. Shorter chains of amino acids are
often called peptides, rather than proteins.</p>
      <p>
        The process of neuropeptide synthesis in a cell is
illustrated in Figure 1. First, a full-length precursor is
synthesized by the standard gene expression machinery of the
cell. This precursor starts with a special short sequence
called a signal peptide, which targets the synthesized
protein to the endoplasmic reticulum. In this organelle, the
signal peptide is cleaved away, resulting in a shorter
protein, which is transported to the Golgi apparatus. Proteins
called convertases then cut the precursor to several shorter
peptides. Some of the peptides resulting from this
cleavage serve as neuropeptides, and are often further modified
by removal of amino acids or attachment of other chemical
groups. The remaining peptides resulting from the
cleavage are destroyed by the cell; these are called propeptides.
Convertases cleave the precursor at specific positions. For
example, furin convertase cleaves after a peptide sequence
of length 4 conforming to the consensus RX[KR]R, where
R stands for arginine amino acid, K for lysine, [KR] for
any these two, and X for any amino acid. However, not
every occurrence of this consensus sequence is cleaved and
the exact rule for recognizing cleavage sites is not known.
More details of this process can be found for example in
a review by
        <xref ref-type="bibr" rid="ref7">Hook et al. (2008)</xref>
        .
      </p>
      <p>We will distinguish three problems related to finding
neuropeptides among all proteins synthesized by the cells
of a given species.</p>
      <p>• Neuropeptide precursor recognition. This is a
classification problem, which gets a protein sequence as
an input and the goal is to decide if this protein will
serve as a likely precursor for neuropeptides.
• Prediction of precursor cleavage sites. Here the input
is a neuropeptide precursor, and the goal is to
determine positions where the convertase will cleave the
protein. It can also be formulated as a decision
problem, where we want to decide for each position in
precursor
removal of signal peptide
cleavage by convertases
post-translational modifications
the precursor, if it is a likely cleavage site based on
a sequence window surrounding this position.
• Neuropeptide annotation. This is the most general
problem. The input might be a neuropeptide
precursor or a different type of protein. The goal is to
decide, if it is a likely neuropeptide precursor, and if
yes, find which portions are likely to serve as
individual neuropeptides. This problem encompasses both
previous problems, and adds additional information,
because neither of the two previous problems
classifies individual products of cleavage as propeptides
and neuropeptides.</p>
      <p>
        Prediction of precursor cleavage sites is traditionally
addressed by a web-based tool called NeuroPred
        <xref ref-type="bibr" rid="ref14 ref15 ref16">(Southey
et al., 2006, 2008a)</xref>
        . NeuroPred uses both handcrafted
consensus motifs as well as methods of machine
learning. These methods (including logistic regression,
neural networks, and the k-nearest neighbors algorithm) were
trained on a set containing both positive and negative
examples from well-studied proteins. Each classification
method considers a short region of the sequence and
decides if it is a likely cleavage site or not.
      </p>
      <p>
        Authors of NeuroPred also developed a pipeline for
identification of likely precursors
        <xref ref-type="bibr" rid="ref15 ref16">(Southey et al., 2008b)</xref>
        .
However, this pipeline can only find precursors which
have similar sequence as one of the already known
precursors from related species, and thus it is not useful for
finding completely overlooked precursors or for
precursors from species that do not have a closely related species
where neuropeptides were already identified.
      </p>
      <p>
        Recently,
        <xref ref-type="bibr" rid="ref11">Ofer and Linial (2014)</xref>
        have created a tool
called NeuroPID which identifies likely neuropeptide
precursors based solely on statistical information about the
protein sequence, such as distribution of amino acids in
short sequence windows, sequence entropy and others.
They define a large number of such features and use known
precursors as well as negative examples to train several
well-known classification methods, such as support vector
machines and random forests.
      </p>
      <p>In our work, we focus on the general neuropeptide
annotation problem, which was not addressed by existing
approaches. We attempt to recognize neuropeptides and their
precursors based solely on sequence features, without
requiring sequence similarity to known neuropeptides.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Overview of Our System</title>
      <p>The goal of our system is to assign to each amino acid
of the input protein a label characterizing its role. We
will use the following labels in neuropeptide precursors:
S for signal peptides, N for neuropeptides, P for discarded
propeptides, and C for cleavage positions. In negative
examples (proteins, which are not neuropeptide precursors),
we use label P¯, but if these proteins also undergo
cleavage into smaller peptides, we mark the cleavage sites by
C¯ and the resulting peptides by X. The result of
annotation is thus a string over the alphabet of possible
labels Σ = {S, N, P, C, X, C¯, X¯} of the same length as the
input protein, where each label describes a putative role of
one amino acid. Neuropeptide precursors are marked by
the presence of regions labeled N, and these regions are
exactly the peptides predicted to serve as neuropeptides.
Individual neuropeptides are separated by an amino acid
marked C, or even by longer regions marked P. The
output sequence of labels is called an annotation of the input
protein.</p>
      <p>
        The main component of our system is a semi-Markov
conditional random field. It uses several features based on
the input sequence, which are computed directly or using
external programs. One of these external programs,
SignalP, is an existing and frequently used software for
recognizing signal peptides based on artificial neural networks
        <xref ref-type="bibr" rid="ref1">(Bendtsen et al., 2004)</xref>
        . The second external program is
a classifier based on support vector machines (SVM) for
recognizing cleavage sites, which we built ourselves. Its
      </p>
      <p>SignalP</p>
      <p>SVM
input sequence</p>
      <p>CRF
goal is to replace NeuroPred in our analysis, because
NeuroPred is available only as a web-based service and as such
is harder to integrate to a stand-alone tool. The overall
architecture of our system is illustrated in Figure 2.
4</p>
    </sec>
    <sec id="sec-4">
      <title>An SVM for Cleavage Site Prediction</title>
      <p>
        To recognize cleavage sites in a protein sequence, we use
one of the standard machine learning methods for
classification, support vector machines
        <xref ref-type="bibr" rid="ref5">(Cortes and Vapnik,
1995)</xref>
        . We use a similar approach as NeuroPred
        <xref ref-type="bibr" rid="ref14">(Southey
et al., 2006)</xref>
        . The input to the SVM is a binary vector
representing a window of size 2k + 1 selected from the
protein sequence. Each amino acid in the sequence
window is encoded as a binary vector of length 20 in which
one position is set to 1 and others to 0. The position of
the value 1 within the vector indicates the identity of the
amino acid. The whole window of length 2k + 1 thus
requires 20(2k + 1) binary variables. The goal of the SVM is
to decide if the protein will be cleaved after the amino acid
located in the middle of the window. We only consider
windows that have amino acids lysine or arginine before
the cleavage site, because cleavage sites seem to obey this
minimal requirement.
      </p>
      <p>
        To train the SVM on a set of examples extracted from
proteins with known cleavage positions, we have used
LibSVM library
        <xref ref-type="bibr" rid="ref3">(Chang and Lin, 2011)</xref>
        . We have used
window of size 11 (k = 5). This window size performed
well in NeuroPred as well as in our preliminary
experiments on various sequences. As a kernel function, we
have chosen the radial basis function. This kernel has a
small number of parameters, good precision and fast
convergence to optimum during training. The use of soft
margin classification and radial basis function requires setting
two hyper-parameters γ and C. We have set these
parameters by cross-validation: we have split the training data
into five groups. We have used four groups to train the
model with different values of hyper-parameters and used
the fifth group for evaluating the prediction accuracy of the
result. This was repeated five times for each choice of the
validation set, and the parameters were chosen based on
averaged performance.
      </p>
      <p>
        To evaluate the accuracy of our model, we have trained
it on 16 known neuropeptide precursors from the honey
bee Apis mellifera with total 70 cleavage sites used
previously to train NeuroPred
        <xref ref-type="bibr" rid="ref15 ref16">(Southey et al., 2008a)</xref>
        . As
testing data, we have used 21 precursors from the
fruitfly Drosophila melanogaster with 87 cleavage sites. This
data was described in the same paper. Our model
classified correctly 91.1% of testing examples, whereas
NeuroPred classified correctly 90.8%. Given this negligible
difference, we conclude that our SVM classifier can be used
instead of NeuroPred to predict cleavage sites in putative
neuropeptide precursors.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Semi-Markov Support Vector Machine for Neuropeptide Annotation</title>
      <p>In this section, we describe our system for neuropeptide
annotation based on a semi-CRF.</p>
      <p>
        Introduction to semi-CRFs. A conditional random field
(CRF) is a probabilistic model for annotating sequences
        <xref ref-type="bibr" rid="ref10">(Lafferty et al., 2001)</xref>
        . Let X = (x1, . . . , xn) be the input
sequence and Y = (y1, . . . , yn) its annotation. A CRF
defines a conditional distribution over possible annotations
Y given sequence X, i.e., Pr(Y|X). To define this
distribution, it uses K local features g1, . . . , gK which can be
tailored specifically for the studied problem.
      </p>
      <p>
        We will use semi-CRF, which are a slight generalization
of CRFs and differ in the exact definition of a local feature
        <xref ref-type="bibr" rid="ref12">(Sarawagi and Cohen, 2005)</xref>
        . In a semi-CRF, the
annotation Y is expressed as a series of p segments (s1, . . . , sp),
each segment using the same label. Segment si is given by
a triple (bi, ei, ai), where bi is the start of the segment, ei is
its end and ai is the annotation label. These values have to
satisfy the following constraints:
b1 = 1;
ei ≥ bi;
ei−1 + 1 = bi;
ep = n;
ai−1 6= ai,
ybi = ybi+1 = ybi+2 = · · · = yei = ai.
      </p>
      <p>Each local feature gk has five inputs: the whole input
sequence X, triple (bi, ei, ai) describing a potential
segment in the annotation, and the label ai−1 of the previous
segment. The resulting number gk(ai−1, bi, ei, ai, X)
represents a score how well the potential labels ai−1, ai fit the
information from sequence X at positions specified by bi
and ei. A global feature Gk for a given annotation Y can be
obtained as a sum of local features for all segments in Y :
p
Gk(X, Y) = ∑ gk(ai−1, bi, ei, ai, X),
i=1
(1)
The conditional probability of annotation Y in a semi-CRF
depends on a weighted combination of global features
Pr(Y|X, w) =
Here, w represents weights of global attributes and Zw(X)
is the normalization constant defined as
The summation runs through all possible annotations Y
of sequence X. The values of individual feature functions
can be arbitrary real numbers, because the normalization
constant Zw(X) ensures that we obtain a proper probability
distribution over all annotations of a given sequence.</p>
      <p>
        To create a semi-CRF, we manually choose a set of
local features, and then use training data to find an
optimal set of weights to maximize the conditional probability
Pr(Y|X, w). For inference, we are given the model,
including the weights, and a sequence X, and our goal is to
find the optimal annotation arg maxY Pr(Y|X, w). This can
be done by a dynamic programming algorithm similar to
the standard Viterbi algorithm for hidden Markov models
        <xref ref-type="bibr" rid="ref12">(Sarawagi and Cohen, 2005)</xref>
        . We have used an existing
implementation of semi-CRFs
        <xref ref-type="bibr" rid="ref13">(Sarawagi et al., 2014)</xref>
        for
both training and inference in our model.
      </p>
      <p>Topology of our model. Our model uses the set of
annotation labels Σ described above. However, there are
restrictions on possible sequences of labels in a permissible
annotation, which are depicted in the state diagram shown in
Figure 3. Every sequence starts with a signal peptide (we
exclude proteins not containing this peptide from our
analysis, as the presence of the peptide can be detected using
the SignalP software). The model has then two branches,
the bottom one corresponding to neuropeptide precursors
and the top one to other proteins, which may or may not be
cleaved. The cleavage sites C and C¯are always segments
of length 1; other segments can be longer.</p>
      <p>Feature functions of the model. Our model uses several
groups of feature functions described below. While
designing these feature functions we took into account a
relatively small size of the available training sets as well as
limitations of the used semi-CRF implementation.
• Model topology. Model topology described above is
enforced only indirectly, by using the following set
of feature functions. For each pair of labels u and v
we create a feature function which indicates, if this
pair occurs in two adjacent segments of a potential
segmentation.</p>
      <p>gedge(u,v) (a0, b, e, a, X) :=
1 if a0 = u ∧ a = v,
0 otherwise.</p>
      <p>(4)</p>
      <p>Such feature function is created for all pairs of
labels, even those that should not follow each other in
our state diagram. However, such forbidden pairs are
never used in our training data, and as a result, their
weight should be negative (in theory −∞). In
contrast, pairs of labels that follow each other frequently
in our training set should get a positive weight by the
training procedure.</p>
      <p>To control initial and final labels of a sequence, we
also create the following two feature functions for
each label v:
gbeginv (a0, b, e, a, X) :=
gendv (a0, b, e, a, X) :=
1 if a = v ∧ b = 1,
0 otherwise,
1 if a = v ∧ e = |X|,
0 otherwise.</p>
      <p>(5)
(6)
• Signal peptide information. During preprocessing,
we run SignalP on our input sequence, obtaining for
each position a probability that this position is the end
of the signal peptide. Let us denote this sequence of
SignalP scores σ1, . . . , σn. Ideally, we would have a
high value of this signal at the last position of the
initial segment labeled S. This is captured by the
following feature function.</p>
      <p>gsignal (a0, b, e, a, X) :=
σe if b = 1 ∧ a = S,
0 otherwise.</p>
      <p>(7)
• Cleavage site prediction. Similarly, we run our SVM
model to obtain a probability at each position of the
sequence, that this position will be cleaved. Let us
denote the sequence of these scores α1, . . . αn. Cleaved
positions should be annotated by labels C or C. To
capture this information, we use the following two
feature functions:
gclv (a0, b, e, a, X) :=
αe if a = v ∧ b = e,
0 otherwise.</p>
      <p>(8)
where v ∈ {C, C}. The weight of the feature
assigned by training corresponds to the importance of
the cleavage site prediction for predicting the correct
annotation.</p>
      <p>For every other label type, we want small values of
the cleavage probability in the entire segment,
because a high probability of cleavage may represent
a cleavage site missed by the predicted annotation.
Therefore for each label type v ∈/ {C, C} we add a
feature function which considers the average squared
value of a cleavage signal inside a segment with this
label.</p>
      <p>g¬clv (a0, b, e, a, X) :=
 1 e 2
 e − b + 1 i∑=b αi
 if a = v,
 0 otherwise.</p>
      <p>Ideally, this feature should get a negative weight in
training, because we want cleavage signals inside
segments to be weak.
• Segment length. Feature functions in a semi-CRF
allow us to score segment lengths. Ideally, we would
use training data to estimate length distribution of
different segment types and use probabilities from these
distributions as feature values. Due to limited
training data, we have decided to use simple features that
classify each length as short, normal or long as
follows.
gshortv (a0, b, e, a, X) :=
1 if a = v ∧ e − b + 1 &lt; minv
0 otherwise,
gnormv (a0, b, e, a, X) :=
 1 if a = v∧
 minv ≤ e − b + 1 ≤ maxv
 0 otherwise,
glongv (a0, b, e, a, X) :=
1 if a = v ∧ e − b + 1 &gt; maxv
0 otherwise.</p>
      <p>
        We have set thresholds minv a maxv manually.
According to
        <xref ref-type="bibr" rid="ref4">Clynen et al. (2010)</xref>
        , 98% of known
neuropeptides have length between 3 and 60. We use a
stricter thresholds minN = 5 a maxN = 60. Shorter or
longer neuropeptides are also permissible, but may
incur a penalty, depending on weights assigned by
training to gshortN and glongN features. The same
threshold values were also used for other segment
types.
(9)
(10)
(11)
(12)
• Statistical properties of the sequence. Finally, we use
a set of attributes describing typical frequencies of
amino acids in different segment types. In
particular,
        <xref ref-type="bibr" rid="ref11">Ofer and Linial (2014)</xref>
        observe that neuropeptides
are characterized by increased frequency of aromatic
amino acids (F, W, and Y). We capture this
information in the following feature function.
      </p>
      <p> if a = v,
 0 otherwise.
 1 e
 e − b + 1 i∑=b[xi is aromatic]

garomav (a0, b, e, a, X) :=
(13)
These features turned out to be useful, because the
weight of function garomaN was always a relatively
high positive number.</p>
      <p>To consider frequencies of other amino acids and
their groups, we have included also features of the
following form:
 e
 ∑ log Pr(xi|v) if a = v,</p>
      <p>i=b
 0
gemitv (a0, b, e, a, X) :=
otherwise.</p>
      <p>(14)
Here, Pr(x|v) is the frequency of amino acid x in
features labeled as v in the training data. This feature
function corresponds to a log-likelihood of a given
segment in an i.i.d. model in which each amino acid
is drawn from distribution Pr(x|v).</p>
      <p>We also use a similar feature function, where we
group together amino acids with similar biochemical
properties (aromatic, basic, acidic etc.). This leads to
the following attributes.</p>
      <p> e
 ∑ log Pr(gr(xi)|v)
gemitgv (a0, b, e, a, X) :=  i=b

 0
if a = v
otherwise,
(15)
where gr(x) represents the biochemical group of
amino acid x, and Pr(gr(x)|v) is the probability that
one of the amino acids from this group appears in a
segment annotated as v. These probabilities are also
estimated from the training data.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>
        Training and testing data. To train and test our model,
we have created a data set of neuropeptide precursors
and other proteins representing negative examples. To do
so, we have selected curated proteins from the UniProt
database
        <xref ref-type="bibr" rid="ref17">(UniProt Consortium, 2008; Uniprot, 2014)</xref>
        . This
database contains annotated protein sequences created
either manually based on results from scientific literature or
by automated methods. We have created two data sets.
The first contains only sequences fromArthropoda and the
A: keyword:"Neuropeptide [KW-0527]"
      </p>
      <p>AND taxonomy:"Arthropoda [6656]"</p>
      <p>NOT keyword:"Receptor [KW-0675]"
B: keyword:"Neuropeptide [KW-0527]"</p>
      <p>AND taxonomy:"Metazoa [33208]"</p>
      <p>NOT keyword:"Receptor [KW-0675]"
C: NOT keyword:"Neuropeptide [KW-0527]"</p>
      <p>AND taxonomy:"Arthropoda [6656]"
AND keyword:"Reference proteome [1185]"</p>
      <p>AND reviewed:yes
D: NOT keyword:"Neuropeptide [KW-0527]"</p>
      <p>AND taxonomy:"Metazoa [33208"
AND keyword:"Reference proteome [1185]"</p>
      <p>AND reviewed:yes
second from all animals (Metazoa). Proteins were selected
using queries shown in Figure 4.</p>
      <p>
        All obtained proteins were clustered based on sequence
similarity using CD-HIT tool
        <xref ref-type="bibr" rid="ref8">(Huang et al., 2010)</xref>
        , where
each cluster contains proteins with 90% identity. We have
then selected a best annotated representative from each
cluster. We have further filtered out sequences with
incomplete or ambiguous annotations. This clustering and
filtering process reduced the size of the Arthropoda data
set from 1340 original proteins to 76 positive examples
and for Metazoa from 2029 to 175 positive examples. A
similar process for negative examples yielded only 35
sequences in Arthropoda and 255 in Metazoa.
      </p>
      <p>To run our experiments, we have used five-fold cross
validation. We have divided the set into five parts. One of
these parts was used for testing and the remaining four for
training. This process was repeated five times and the
results were averaged. The SVM for cleavage site prediction
was trained on the same training data as the CRF.
Annotation accuracy. Table 1 shows the annotation
accuracy of different labels in four versions of our model. Two
versions used only Arthropoda sequences for training and
testing and two used a larger but more heterogeneous set
of Metazoa proteins. For each set, we have also
considered two models. One, denoted by subscript p, used only
positive examples, and restricted the model to labels S, C,
N, and P. The second one used the full model and both
positive and negative examples.</p>
      <p>To evaluate prediction accuracy, we consider each label
separately and count the number amino acids that were
correctly or incorrectly labeled by this label, obtaining the
counts of true positives (TP), false positives (FP), and false
negatives (FN). We then calculate the F1 measure, which
is a combination of precision and recall</p>
      <p>Here, precision (also known as positive predictive value) is
the fraction of predicted occurrences of the label that are
indeed correct, i.e. TP/(TP + FP). Recall (also known as
sensitivity) is the fraction of real occurrences of the label
that were correctly predicted, i.e. TP/(TP + FN).</p>
      <p>In addition, we also compute the overall accuracy of the
prediction as the fraction of correctly labeled amino acids,
considering all labels together.</p>
      <p>In most measures, the models restricted to Arthropoda
are more accurate than the Metazoa models. This suggests
that the statistical properties of neuropeptide precursors in
vertebrates and invertebrates are sufficiently different to
offset the advantage of using a larger combined training
set.</p>
      <p>We achieve a higher overall accuracy on the simpler
problem without negative examples, which is expected,
because it is an easier problem. In the general problem
with negative examples, the prediction accuracy for label C
is greater than for label C¯. This might be caused by the fact
that the cleavage site prediction model was trained only on
neuropeptide precursors. Thus, we could potentially
improve the prediction accuracy by training SVM models for
neuropeptide precursors and other proteins separately.</p>
      <p>To further quantify the effect of incorrect predictions by
the SVM and SignalP on the accuracy, we have created a
version of the Arthropoda model which replaces the scores
from these two predictors by the correct values (value 1 for
cleavage sites, value 0 for other sites). These values were
then used for both training and testing. This modification
increased the overall accuracy from 70.4% to 79.1%. This
represents an upper bound of improvements achievable by
changes in the SignalP and SVM predictors.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We have designed and implemented a system for
annotation of neuropeptides. In future, we hope to use this
system to seek novel neuropeptides in various organisms and
to evaluate such predictions in collaboration with life
scientists.</p>
      <p>Acknowledgments. This research was supported by VEGA
grants 1/1085/12 and 1/0719/14 and by the grant of the
Slovak Academy of Sciences MVTS GAMMA.
98.1%
95.9%</p>
      <p>P
78.6%
71.5%</p>
      <p>Uniprot
release
2014-01.</p>
      <p>UniProt Consortium (2008). The universal protein
resource (UniProt). Nucleic Acids Research, 36(suppl
1):D190–D195.</p>
      <p>Yandell, M. and Ence, D. (2012). A beginner’s guide to
eukaryotic genome annotation. Nature Reviews
Genetics, 13(5):329–342.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bendtsen</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
          </string-name>
          , H., von Heijne, G., and
          <string-name>
            <surname>Brunak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Improved prediction of signal peptides: SignalP 3.0</article-title>
          .
          <source>Journal of Molecular Biology</source>
          ,
          <volume>340</volume>
          (
          <issue>4</issue>
          ):
          <fpage>783</fpage>
          -
          <lpage>785</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Burge</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Karlin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Prediction of complete gene structures in human genomic DNA</article-title>
          .
          <source>Journal of Molecular Biology</source>
          ,
          <volume>268</volume>
          (
          <issue>1</issue>
          ):
          <fpage>78</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.-C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>LIBSVM: a library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Clynen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Husson</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landuyt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayakawa</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baggerman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wets</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schoofs</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Bioinformatic approaches to the identification of novel neuropeptide precursors</article-title>
          .
          <source>In Peptidomics</source>
          , volume
          <volume>615</volume>
          of Methods in Molecular Biology, pages
          <fpage>357</fpage>
          -
          <lpage>364</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>20</volume>
          (
          <issue>3</issue>
          ):
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>DeCaprio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinson</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pearson</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montgomery</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doherty</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Galagan</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Conrad: gene prediction using conditional random fields</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>17</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1389</fpage>
          -
          <lpage>1398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hook</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funkelstein</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wegrzyn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hwang</surname>
            ,
            <given-names>S.-R.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Proteases for processing proneuropeptides into peptide neurotransmitters and hormones</article-title>
          .
          <source>Annual Review of Pharmacology and Toxicology</source>
          ,
          <volume>48</volume>
          :
          <fpage>393</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>CDHIT Suite: a web server for clustering and comparing biological sequences</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <fpage>680</fpage>
          -
          <lpage>682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Krogh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larsson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>von Heijne</surname>
          </string-name>
          , G., and
          <string-name>
            <surname>Sonnhammer</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes</article-title>
          .
          <source>Journal of Molecular Biology</source>
          ,
          <volume>305</volume>
          (
          <issue>3</issue>
          ):
          <fpage>567</fpage>
          -
          <lpage>570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F. C. N.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001)</source>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Ofer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Linial</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>30</volume>
          (
          <issue>7</issue>
          ):
          <fpage>931</fpage>
          -
          <lpage>940</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W. W.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Semi-Markov conditional random fields for information extraction</article-title>
          .
          <source>InAdvances in Neural Information Processing Systems (NIPS</source>
          <year>2004</year>
          ), pages
          <fpage>1185</fpage>
          -
          <lpage>1192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          et al. (
          <year>2014</year>
          ). http://crf.sourceforge.net/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Southey</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amare</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmerman</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>RodriguezZas</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sweedler</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides</article-title>
          .
          <source>Nucleic Acids Research</source>
          , 34(W):
          <fpage>W267</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Southey</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sweedler</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rodriguez-Zas</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          (
          <year>2008a</year>
          ).
          <article-title>Prediction of neuropeptide cleavage sites in insects</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>24</volume>
          (
          <issue>6</issue>
          ):
          <fpage>815</fpage>
          -
          <lpage>815</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Southey</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sweedler</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rodriguez-Zas</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          (
          <year>2008b</year>
          ).
          <article-title>A Python analytical pipeline to identify prohormone precursors and predict prohormone cleavage sites</article-title>
          .
          <source>Frontiers in Neuroinformatics</source>
          ,
          <volume>2</volume>
          :
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Uniprot</surname>
          </string-name>
          (
          <year>2014</year>
          ). http://www.uniprot.org/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>