=Paper= {{Paper |id=None |storemode=property |title=Gene finding with complex external information |pdfUrl=https://ceur-ws.org/Vol-788/paper6.pdf |volume=Vol-788 |dblpUrl=https://dblp.org/rec/conf/itat/KucharikKB11 }} ==Gene finding with complex external information== https://ceur-ws.org/Vol-788/paper6.pdf
                Gene finding with complex external information?

                                Marcel Kucharı́k, Jakub Kováč, and Broňa Brejová

                 Department of Computer Science, Faculty of Mathematics, Physics, and Informatics
                        Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia

Abstract. The goal of gene finding is to locate genes,         account statistical properties of genic and non-genic
which are important segments of DNA encoding proteins.         regions of DNA, such as frequency of k-tuples, length
Programs solving this task are based on hidden Markov          distribution of genes, and characteristic sequence mo-
models (HMMs) capturing statistical features extracted         tifs at gene boundaries. Genes found by HMMs are of-
from known genes, but often also incorporate hints about       ten imprecise, and to increase the accuracy, statistical
the correct gene structure extracted from experimental
                                                               models of DNA are often combined with additional ex-
data. Existing gene finding programs can use such external
information only in a limited way. Typically, they can pro-
                                                               ternal information provided by biological experiments.
cess only simple hints describing a single part of the gene    External information is often in the form of individual
structure, because these are relatively easy to incorporate    hints, each hint giving a probable location of one gene
to standard HMM algorithms, but cannot cope with com-          or a part of a gene.
plex hints spanning multiple parts. We have developed an            To combine an HMM with hints, we can manipu-
efficient algorithm able to process such complex hints. Our    late probabilities of individual labelings A so that the
experiments show that this approach slightly increases the     labeling gets a bonus for each hint agreeing with it.
accuracy of gene prediction. We also prove that a more         Different gene finders define the set of possible hints
general class of hints leads to an NP-hard problem.            and their influence on the probability space differently,
                                                               but the possibilities are restricted by the need for ef-
                                                               ficient algorithm finding the annotation with highest
1     Introduction                                             probability in the modified space. The simplest form
                                                               of a hint is a point-wise hint which increases or de-
In this paper we study a combinatorial optimization            creases probabilities of all annotations that have a cer-
problem arising in computational biology. Although             tain label at a certain position in the sequence [2, 16].
we were originally motivated by a very specific appli-         Such hints can be easily incorporated to the standard
cation, we formulate the problem quite generally and           Viterbi algorithm for finding the most probable anno-
explore its variants that admit efficient polynomial-          tations in HMMs.
time algorithms as well as those that are NP hard.
                                                                    However, a single piece of evidence often suggests
    Our motivation comes from the problem of gene              labels for a longer interval of the sequence. Splitting
finding. Here the goal is to locate segments of a DNA          such information into multiple point-wise hints leads
sequence that encode proteins produced by the organ-           to information loss, because some labelings can get the
ism. On a more abstract level, we are given a string           bonus even if they do not agree with the evidence over
X = x1 . . . xn and the goal is to produce a string            the whole interval. Therefore, some gene finders [14, 5]
A = a1 . . . an over some output alphabet Σ of possible        use hints in the form of intervals such that the proba-
labels, such that ai is a label of xi , corresponding to its   bility of a labeling is increased only if it has the label
functional role. In gene finding, X is the input DNA           prescribed by the hint throughout the whole extent
sequence (over alphabet {A, C, G, T }) and the output          of the interval. We call such hints interval hints. The
alphabet could be Σ = {0, 1} where 1 stands for genes          main focus of our paper is to study further generaliza-
and 0 for non-genic regions. We will call A a labeling         tions of interval hints.
or annotation of the input string X.
                                                                    For most of the time we will abstract away both
    The desired mapping from X to A is in gene finding
                                                               the HMM and the input string X and consider only
often characterized by a probabilistic model defining
                                                               a set of hints. Indeed, probabilities from the HMM can
probability distribution P (A|X). We then output the
                                                               be expressed as special hints of length 2, as we explain
labeling with the highest probability. Frequently used
                                                               in Section 2. Therefore in the most general setting, we
models are hidden Markov models (HMMs) or their
                                                               can formulate our problem as follows:
variants such as hidden semi-Markov models and con-
ditional random fields [3, 5]. These models take into
                                                      Definition 1 (Optimal labeling with hints prob-
?
    This research is funded by European Community FP7 lem). Given is positive integer n, finite alphabet Σ,
    grant IRG-231025, VEGA grant 1/0210/10, and APVV and a set of hints. Each hint is a pair (γ, b) where
    grant SK-CN-0007-09.                              γ ⊆ Σ n is a set of labelings and b ∈ R is a bonus. We
40       Marcel Kucharı́k, Jakub Kováč, Broňa Brejová

say that a labeling A ∈ Σ n agrees with hint (γ, b) if      form with running time O(n2 |Σ|2 ). With appropriate
A ∈ γ. The score of a labeling A ∈ Σ n is the sum of        data structures, it is possible to implement the algo-
bonuses of all hints that agree with A. The goal is to      rithm so that processing of hints adds only an additive
find a labeling with the maximum score.                     term O(m) where m is the number of hints. However,
                                                            a more efficient algorithm would be desirable if we
     In our work the hints will assume a special form wanted to use hints with regular HMMs, where the
allowing compact representation of a potentially large Viterbi algorithm is linear in n. Running time of our
set γ. In particular, a complex hint is a four-tuple algorithm for complex hints with positive bonuses will
(s, e, Y, b) such that 1 ≤ s ≤ e ≤ n and Y ∈ Σ e−s+1 not depend on the sequence length, only on the num-
is the labeling suggested for xs . . . xe (see Figure 1). ber and total length of hints.
Set γ for this hint is γ = Σ s−1 Y Σ n−e . Interval hints
are a special case of complex hints with Y = ak for
some a ∈ Σ and point-wise hints are a special case of Complex hints with positive bonuses. In order to in-
interval hints with s = e.                                  troduce our algorithm for a set of complex hints with
                                                            positive bonuses, we start with several necessary def-
                                                            initions. Let h = (s, e, y1 . . . ys−e+1 , b) be a complex
                h1 = (1, 4, 0001, 2)    0001...             hint. Then for i ∈ [s, e] expression `(h, i) denotes la-
                h2 = (2, 3, 00, 1)      .00....             bel suggested by h for xi , that is, `(h, i) = yi−s+1 .
                h3 = (2, 6, 00100, 1)   .00100.             Now let us consider two hints h1 = (s1 , e1 , Y1 , b1 ) and
                h4 = (3, 6, 0001, 1)    ..0001.             h2 = (s2 , e2 , Y2 , b2 ). We say that h1 and h2 are com-
                h5 = (6, 7, 02, 1)      .....02
                                                            patible if they agree in the intersection of their inter-
                A                       0001002
                                                            vals, that is, if for every i ∈ [s1 , e1 ] ∩ [s2 , e2 ] we have
Fig. 1. A set of complex hints for n = 7 and Σ = {0, 1, 2} `(h1 , i) = `(h2 , i) (non-intersecting hints are always
and an optimal labeling with score 5, agreeing with hints compatible). Hint h1 is a subset of hint h2 if they are
h1 , h2 , h3 and h5 . Note that h2 is an interval hint.     compatible and [s1 , e1 ] ⊆ [s2 , e2 ]. Hint h2 is an ex-
                                                            tension of h1 if they are compatible and e2 > e1 and
     In this work we describe an efficient algorithm for s2 > s1 . Note that if s1 = s2 , e1 = e2 , and Y1 = Y2 ,
finding the optimal labeling for a set of complex hints, these two hints can be replaced by a single hint with
which is a non-trivial extension of algorithms for in- bonus b1 + b2 . We will assume that such a transforma-
terval hints (Section 2). We also show that additional tion is done on all applicable pairs.
natural generalization of the problem where some po-             In our algorithm, we transform the input to create
sitions between s and e are allowed to vary, leads to a weighted directed acyclic graph (DAG) such that the
an NP-hard problem (Section 3). Finally in Section 4, optimal labeling corresponds to the path with maxi-
we describe the use of our algorithm in the context mum weight between given two vertices s and t. Such
of gene finding and show experimental results on real longest paths can be found in DAGs efficiently, in
and simulated data.                                         O(|V | + |E|) time [4]. Our algorithm could be also
                                                            expressed directly as dynamic programming, but the
2 An algorithm for complex hints                            graph representation is more convenient for proving
                                                            correctness and considering variants of the algorithm.
In this section we give a polynomial-time algorithm              Our graph has one vertex for each hint and two spe-
which finds the optimal labeling for a set of complex       cial  vertices s and t. There is an edge from h1 to h2 if
hints. We will first discuss several simpler cases, later   and   only if h2 is an extension of h1 . The weight of this
generalizing the algorithm to the full problem.             edge   is the sum of the bonus of h2 and bonuses of all
                                                            hints that are subset of h2 but are not a subset of h1
Interval hints. Gene finders employing interval hints (see Figure 2). Vertices s and t are connected with
typically proceed by dynamic programming. For each other vertices as if they corresponded to sentinel inter-
prefix x1 . . . xi of the input string X, they consider all vals beyond the end of the sequence, namely (0, 0, a, 0)
possibilities for the last region of this prefix labeled by and (n + 1, n + 1, a, 0). Clearly the graph created in
one label (that is, all possible j ≤ i and a ∈ Σ such this way is acyclic, because vertices on each path have
that aj = aj+1 = · · · = ai = a and aj−1 6= a). For increasing coordinates of their endpoints. Correctness
each such interval (j, i), it is easy to compute the sum of the algorithm is given by the following theorem.
of bonuses of all interval hints agreeing with label a.
                                                                                              ∗
     This type of algorithm is dictated by the fact that Theorem 1. The weight W of the maximum-weight
gene finders combine hints with hidden semi-Markov path from s to t in the graph described above is equal
                                                                             ∗
models or their variants [14, 5] and the Viterbi algo- to the score S of an optimal labeling for the input set
rithm for inference in these models already has this of hints.
                                                                Gene finding with complex external information         41


          3                                          0
                                                                   Our graph has O(m) vertices and O(m2 ) edges
                0    0   0   1                                 where m is the number of hints. As we will show be-
          1
                     0   0
                                                     0         low, it can be constructed in O(m2 + `) time, where `
                             1
          2                                          0         is the sum of lengths of all hints. The running time of
      s              0   0   1       0   0               t
          1
                                 1
                                                     0
                                                               the whole algorithm is therefore O(m2 + ` + n) where
                         0   0       0   1                     the dependence on n comes only from the necessity to
                                                 1
          1                                          0         produce a labeling of length n (using arbitrary labels
                                         0   2
                                                               for parts of the sequence not covered by any hint).
Fig. 2. Directed acyclic graph created by our algorithm for
hint set from Figure 3. Vertices for hints are shown as grey
boxes.                                                         Running time improvement for practical instances. In
                                                               practical gene finding instances, we expect hints to
Proof. First, we will prove that for every labeling A          be much shorter than n and to be spread along the
with score S there is a path with weight S and there-          length of the sequence. We can modify the graph so
fore W ∗ ≥ S ∗ . Let H be the set of hints that agree          that it has fewer edges if every position is covered by
with A. Let HS be the set of all hints from H that are         at most d hints for some d < m. The construction
a subset of another hint from H, and let HR = H\HS .           shown above will create edges even between hints that
Let h1 and h2 be two different hints from HR such that         are far apart, and those can be eliminated. The weight
s1 < s2 . Then also e1 < e2 because no hint in HR is a         of an edge from hint h0 to hint h depends on mutual
subset of another, and these two hints are compatible          position of these hints, because we include only those
because they both agree with A. Therefore, h2 is an            subset hints of h that end after the end of h0 . However,
extension of h1 .                                              the weight will be the same for all hints h0 that do not
    Our path will start in s, pass through all vertices        intersect h. Our goal is to remove such edges from
in set HR ordered from left to right by their left end-        the graph and replace them with a smaller number of
points and end in t. As we have shown, all necessary           edges linear in m.
edges on the path exist. Bonus of hint h ∈ HR is in-
cluded in the weight of the edge entering h and bonus               We will say that a hint h2 is a strict extension of
of hint h ∈ HS is included in the weight of the edge en-       a hint h1 if h1 is an extension of h2 and the two hints
tering the leftmost hint h0 ∈ HR such that h is a sub-         overlap. Our new graph will have a vertex for each
set of h0 . Bonuses of no other hints are included in          hint, two special vertices s and t for sentinel hints
the weight of the path, and therefore its weight is ex-        and a skipping vertex vi for every position i which
actly S.                                                       is a right endpoint of at least one hint (including po-
    Now we will prove that for any path π from s               sition 0 where the sentinel hint for s ends). Skipping
to t with weight W , there is a labeling with score at         vertices are connected to a chain going from left to
least W , implying that W ∗ ≤ S ∗ . Let H0 be the set of       right with edges of weight 0. Vertex for a hint h is
all hints corresponding to vertices of π except s and t.       connected to all hints that are its strict extensions,
We will construct a labeling A = a1 . . . an as follows.       with the same edge weights as before. It is also con-
If a position i is not covered by any hint in H0 , its         nected to the skipping vertex for its right endpoint
label ai can be chosen arbitrarily. If i is covered by         with an edge of weight 0. Finally, each skipping ver-
some hints h1 , . . . , hk from H0 , they suggest the same     tex vi is connected to all hints that start in the interval
label for position i and this label is chosen as ai .          (i, j] where vj is its neighbor in the chain of skipping
    Now let H00 be the set containing all hints from H0        vertices. The edge from vi to h will contain the bonus
and all hints that are subsets of hints from H0 . The          of h and all its subset bonuses. Paths in this new graph
score S of path π is the sum of bonuses of hints               have 1-1 correspondence with the paths in the origi-
from H00 . Also, each hint from this set agrees with A.        nal graph, where an edge between two non-overlapping
The score of A is the sum of all bonuses that agree            hints is now replaced by a path through skipping ver-
with it, therefore S ≤ W .                               t
                                                         u     tices. The number of vertices is O(m) and the number
                                                               of edges O(md), since from a vertex for hint h there
    We have proved that the weight of the optimal path         are at most d outgoing edges: one to a skipping ver-
is equal to the score of the optimal labeling and the          tex and at most d − 1 to other hints, because all these
proof also implies an algorithm to convert the optimal         hints cover the right endpoint of h. Similarly the num-
path to a labeling with the same score. Note however           ber of incoming edges is at most d. Finally, the num-
that weights of some suboptimal paths do not corre-            ber of edges between skipping vertices is O(m). The
spond to the score of any labeling, as a labeling that         overall time of the algorithm with this graph will be
agrees with all vertices on a path may also agree with         O(md + ` + n); we still need to demonstrate that the
other hints that the path avoids.                              graph can be constructed within this running time.
42        Marcel Kucharı́k, Jakub Kováč, Broňa Brejová

    The critical ingredient is the ability to check in con-          with positive bonuses. Each hint (s, e, Y, b) with a neg-
stant time whether two hints are compatible. This can                ative bonus b is transformed to a set of hints of the
be achieved by building a suffix tree of the strings asso-           form (s, e0 , Y 0 , |b|) such that s ≤ e0 ≤ e and Y 0 is
ciated with individual hints and preprocessing the suf-              the same as Y on the first e0 − s positions and differs
fix tree for longest extension queries. These queries al-            from Y on its last position. We create all (|Σ| − 1)
low us to take two suffixes of any of the strings and de-            (e − s + 1) combinations of e0 and Y 0 of this form.
termine the length of their common prefix. In our sce-               Clearly, these hints are mutually incompatible, and
nario we take two hints (s1 , e1 , Y1 , b1 ), (s2 , e2 , Y2 , b2 )   therefore at most one of them will agree with any an-
and if they overlap by some length d > 0, we de-                     notation. An annotation A agrees with one of these
termine the positions of the start of the overlap in                 hints if and only if it does not agree with the original
strings Y1 and Y2 . If the suffixes starting at these two            hint h. This transformation increases the score of ev-
positions have a common prefix of length at least d,                 ery annotation by |b|: the annotations agreeing with h
the two hints are compatible. They are also compati-                 will increase by |b| because hint h is omitted from the
ble if d = 0. The preprocessing of necessary structures              set and all other annotations agree with one of the new
can be done in O(`) time, and compatibility of each                  hints, thus getting an additional bonus of |b|. When we
pair of hints can be then assessed in O(1) time [8].                 transform all hints with negative values in this way,
    To find the edges between hint vertices, we first                we increase the total score of every labeling by the
sort all hints by left endpoints. This can be done in                sum of absolute values of all negative bonuses. This
O(n + m) time by counting sort. Next, we traverse the                is a constant and therefore the optimal labeling will
sorted list and maintain a set of active hints. A hint               remain the same and can be found by the algorithm
is active if it overlaps the left endpoint of the current            described above. Unfortunately, the number of hints
hint in the list. In every step, the list contains at most           m as well as their total length ` increase by a factor
d hints, and therefore we can in each iteration traverse             of L|Σ| where L is the length of the longest negative
the list and remove hints that are no longer active.                 hint, but the running time is still polynomial.
The current hint h is a strict extension of any active
hint h0 such that h0 and h are compatible and h is not               Combination of complex hints with an HMM. In gene
a subset of h0 . This can be checked in O(1) time per                finding, we typically (although not exclusively [1])
pair of hints. If the current hint h is a subset of some             combine the score from hints with a probability value
active hint h0 , we will append h to a list of subsets               from some model capturing typical sequence features
of h0 .                                                              of genes and their constituents. Probability of a par-
    At the end of traversal, we have for each hint h the             ticular labeling A in hidden Markov models or condi-
list of its subset hints and the list of incoming edges              tional random fields can be written as
from other hint vertices. We sort both these lists by
                                                                                              n
right endpoints by counting sort in time proportional                                         Y
                                                                                    P (A) =         p(i, ai , ai+1 , X),
to the length of h (the right endpoints of all these
                                                                                              i=1
hints are within h). Now we use a merge-like algorithm
with two pointers to the two lists to find for every                 where factors p(i, ai , ai+1 , X) depend only on two ad-
incoming edge from hint h0 the bonus sum of subset                   jacent labels ai and ai+1 (and some portion of the
hints that end after the end of h0 . Edges incident to               input string X). They are computed from the emis-
skipping vertices can be easily constructed within the               sion and transition probabilities of the model. We will
desired running time. The overall running time of the                consider the case when the total score of A has the
graph construction is thus O(md + ` + n) where the                   form
n term comes from sorting and can be replaced by
                                                                                                     n
O(m log m).                                                                                          X
                                                                           B + log P (A) = B +             log p(i, ai , ai+1 , X),
                                                                                                     i=1
Complex hints with arbitrary bonuses. The algorithm
described above is efficient, but can only handle hints              where B is the sum of bonuses of all agreeing hints. The
with positive bonus values. If a hint h has a nega-                  log values in the sum can be written as hints of length 2
tive bonus value, the path could skip its vertex in the              of the form (i, i + 1, ai ai+1 , log p(i, ai , ai+1 , X)). Over-
graph, even if other visited vertices imply that the la-             all, we will need n|Σ|2 such hints and they typically all
beling agrees with h and its bonus should be counted.                have negative score. After applying the transformation
The graph may thus contain paths with weight higher                  described in the previous paragraphs and combining
that the score of the optimal labeling.                              together hints with the same values of s, e and Y, we
    As we will show, we can transform a set of hints                 get O(n|Σ|2 ) additional hints of length 1 and 2 with
with arbitrary bonuses to an equivalent set of hints                 positive bonuses.
                                                                 Gene finding with complex external information         43

    This generic procedure works only if the states of          corresponds to jth literal from ith clause, where pi,j =
the HMM correspond exactly to the symbols of the                N +3(i−1)+j. For each such literal we will add an as-
output alphabet Σ. Gene finding HMMs typically have             signment hint of the form (k, pi,j , z{0, 1}pi,j −k−1 1, 1)
a much larger state space, with several states cor-             where k is the index of the variable forming this literal
responding to the same output label. This situation             and z = 1 if the literal is yk and z = 0 if the literal
could be expressed using more general subset hints de-          is ¬yk . Notation {0, 1}c represents c copies of the set
scribed in Section 3. Although subset hints lead in gen-        {0, 1} ∈ Σ+ . In other words, the hint connects the lit-
eral to NP-hard problems, special cases necessary for           eral with its satisfying variable assignment, and if the
handling large state spaces can be solved efficiently, as       literal is selected by api,j = 1 and the variable yk has
discussed at the end of the next section. Moreover, it is       the satisfying assignment, the labeling gets a bonus 1.
also possible to create specialized algorithms for com-         We will also add three selector hints for every clause,
bining complex hints with a hidden Markov model [12].           that enforce that exactly one literal from the clause
By taking the special structure of the problem into ac-         is selected by api,j = 1. These hints have the form
count, we may obtain more efficient solutions than by           (pi,1 , pi,3 , Y, K) where Y is a binary string of length
converting the HMM to a collection of hints.                    three containing exactly two zeroes. Note that for each
                                                                clause at most one of the selector hints can agree with
                                                                a given labeling.
3    NP-hardness for subset hints                                    In order to achieve total score M + M K, the an-
                                                                notation has to agree with one selector hint for ev-
A subset hint is a four-tuple h = (s, e, Y, b) such that        ery clause. The loss of even one such selector could
numbers s, e, and b are defined as for a complex hint           not be compensated because the sum of bonuses of all
and Y is a string of length e − s + 1 over an extended          assignment hints is only 3M , which is less than K.
alphabet Σ+ containing all non-empty subsets of Σ.              Selector bonuses that contribute bonus M K enforce
Expression `(h, i) now denotes the set of labels sug-           that exactly one literal in each clause is selected. The
gested by h for xi and a labeling A agrees with h if for        selected literal has a potential to contribute bonus 1
every position i ∈ [s, e] the label ai is in the set `(h, i).   if it is indeed satisfied in the assignment specified by
Unfortunately, our algorithm for complex hints can-             the labeling of the first N symbols. If the formula has
not be straightforwardly extended to this scenario, be-         score at least M + M K, every clause has exactly one
cause the relation of extension is not transitive, which        literal which is simultaneously selected and satisfied.
is crucial for the existence of a labeling satisfying all       Note that multiple literals per clause may be satisfied,
hints on a path in our graph.                                   but if they are not selected, they will not influence
    As we will prove, it is unlikely that the problem           the score. Annotation with score M + M K thus corre-
can be solved efficiently in the full generality by other       sponds to a satisfying assignment of the formula and
methods, since it is NP-hard.                                   vice versa.                                              t
                                                                                                                         u
Theorem 2. The problem of testing whether there is              This proof is a slight modification of our earlier
a labeling A with score at least t for a set of subset proof for the use of RT-PCR queries in gene find-
hints is NP-complete, even for a binary alphabet Σ.         ing [10]. In this earlier work we have considered hints
                                                            of a special form that suggest a specific label at both
Proof. Clearly the problem is in NP (if the size of in- ends and allow arbitrary labels in the middle. How-
put is much smaller than n, we can specify only por- ever the output labeling was constrained to obey an
tions of A covered by hints and this is sufficient for additional DAG of possible gene structures. Here we
efficient computation of the score).                        have adapted the proof for the simpler case where the
     To prove NP-hardness, we will use a reduction from annotation can be arbitrary.
the 3-SAT problem. Consider a 3-SAT instance with
N variables y1 , . . . , yN and M clauses, each clause con- Partition subset hints. Subset hints often occur in
sisting of three literals. We will encode it as a set of practice because our methods for obtaining additional
hints over the alphabet Σ = {0, 1} such that the for- information about the labeling cannot distinguish be-
mula has a satisfying assignment if and only if there tween some labels from Σ. In some situations we can
is a labeling with score at least M + M K where K = represent subset hints as complex hints over an alpha-
3M + 1.                                                     bet Σ 0 ⊆ Σ+ that forms a partition of Σ into equiva-
     The length of the annotated sequence will be n = lence classes (in other words, every two elements in Σ 0
N + 3M . The first N labels correspond to a truth as- are disjoint subsets of Σ). We will now consider a situ-
signment to all variables, that is, ai = 1 corresponds ation where every hint contains only characters from Σ
to yi being true. Each of the remaining labels cor- or only characters from Σ 0 (but cannot use charac-
responds to one literal from the formula, namely api,j ters from both). We will show that optimal labeling
44       Marcel Kucharı́k, Jakub Kováč, Broňa Brejová

(over Σ) for such hint sets can be found in polynomial           is relatively well sequenced and annotated. Genomic
time.                                                            sequences and reference annotations were downloaded
    Briefly, we first create a directed acyclic graph G          from the UCSC genome browser [7]. We have used
for hints over Σ similarly as in Section 2. We add new           44 MB of DNA sequence containing 5 164 genes. They
hints of the form (i, i, a, 0) for all positions i and labels    were split to 2 MB parts and divided to train-
a ∈ Σ. Both the original hints and these new hints               ing (≈ 28 MB, 3368 genes) and testing (≈ 16 MB,
will be vertices. We will have an edge from hint h1              1796 genes) data sets.
to hint h2 if h2 is an extension of h1 and they ei-                  The HMM has been taken from the work of [15]
ther overlap or h2 starts immediately after h1 ends.             and retrained on our training data set by standard
In this graph, every path consists of hints that com-            procedures [6]. It has 265 states, but it is quite simple
pletely cover the sequence. In the same way we also              compared to state-of-the-art gene finders. For exam-
build a graph G0 for hints over Σ 0 . Finally, we create         ple, it does not allow arbitrary length distributions of
graph G00 in which each vertex is a pair (h, h0 ) such           exons, introns and intergenic regions and uses simpler
that h is a vertex in G, h0 is a vertex in G0 , hints h          models of sequence motifs at exon boundaries. In our
and h0 overlap and are compatible. If hint h0 ends be-           experiments we compare the accuracy achieved on the
fore h, this vertex will be connected to vertices of the         testing set by the HMM alone and with different sets
form (h, h00 ) where (h0 , h00 ) is an edge in G0 , the weight   of hints.
of the edge is also copied from G0 . Similarly, if h ends
before h0 , (h, h0 ) will be connected to vertices of the
form (h00 , h0 ) where (h, h00 ) is an edge in G. Finally,                                                     Protein hints
if the two hints end together, they will be connected                 Gene 1                Gene 2
to vertices of the form (h00 , h000 ) where (h, h00 ) is an                                                    Reference
edge in G and (h0 , h000 ) is an edge in G0 . The weight of
                                                                                                               Interval hints
the new edge will be the sum of the two source edges.
The best path in G00 then corresponds to the optimal                                                           Complex hints
annotation (proof omitted due to space).
                                                                                                               No hints
    The algorithm can be extended to several groups
of hints, each over a different partition of Σ, but the          1226000       1227000       1228000       1229000
running time would increase exponentially in the num-
ber k of such groups, because the vertices in the final          Fig. 3. Example of a prediction obtained with different
graph will be k-tuples of vertices of graphs for individ-        protein hint sets in a region containing two genes. Lines de-
                                                                 pict introns, rectangles exons (gene 1 has one exon, gene 2
ual groups of hints.
                                                                 has four exons). Four identical complex hints span the
                                                                 whole length of gene 1 and suggest short intergenic re-
4    Use of complex hints in gene finding                        gion beyond its boundary. Prediction with complex hints is
                                                                 correct. Prediction with interval hints does not agree with
We have implemented a variant of the algorithm from              the intergenic hint at the right end of the gene. Prediction
Section 2 and applied it to the problem of gene finding.         without hints predicts a long gene extending beyond the
                                                                 displayed region.
Recall that the goal of gene finding is to find genes,
regions encoding proteins in DNA sequences. A gene
in eukaryotic organisms can be divided into several              Experiment with protein hints. In the first experiment,
segments called exons and introns, and only exons en-            we have used a real set of hints originating from
code the protein (see Figure 3). Thus the goal of gene           a database of known proteins. We have downloaded all
finding is to find the position of each gene and its ex-         known proteins from several insect species (D. melano-
act partitioning to exons and introns. We use output             gaster, D. simulans, D. pseudoobscura, Anopheles
alphabet Σ = {C, I, X}, where C stands for coding                gambiae, and Bombyx mori ) from the NCBI RefSeq
exons, I for introns, and X for intergenic regions.              database [13] (67, 893 protein sequences in total). We
    The algorithm was implemented in C++ as a stan-              have used the BLAT program [9] and scripts from Ex-
dalone console application called grapHMM. In order              onHunter distribution [2, 11] to find regions in the test-
to process long DNA sequences, we implemented sev-               ing set that could encode identical or similar proteins
eral practical improvements. For example, the graph              to those in the database. Weak matches were were
is generated on the fly and unnecessary information is           filtered out to increase the specificity. As a result of
immediately deleted to free up memory.                           this strict filtering, only 11% of the sequence is cov-
    We have tested the program on genomic data from              ered by hints, whereas genes cover almost 45% of the
Drosophila melanogaster (fruit fly). This species is an          sequence. Hints contain exon and intron labels (ex-
important model organism in genetics and its genome              ons are regions where the protein seems to be encoded
                                                             Gene finding with complex external information               45

and introns are separating parts encoding the same              Hint set   No hints   Point-wise     Interval   Complex
protein). If the whole source protein could be aligned          Bonus         –           8             8           6
to the genome, a single intergenic position is added            Gene sn.   48.83%      61.08%        61.30%      61.86%
to the beginning and to the end of the complete gene            Gene sp.   40.40%      46.00%        46.42%      46.61%
                                                                Exon sn.   71.18%      75.89%        76.20%      76.14%
hint.
                                                                Exon sp.   69.01%      71.73%        72.31%      72.29%
    Our goal was to compare prediction accuracy ob-             Base sn.   92.81%      94.35%        94.15%      94.14%
tained using complex hints with accuracy obtained us-           Base sp.   91.86%      92.00%        92.10%      92.08%
ing simpler interval and point-wise hints. Therefore we
have converted our hint set to these simpler forms by       Tab. 1. Comparison of hint sets created from protein se-
                                                            quences. For each set, we show the trained value of the
splitting each complex hints to several shorter hints
                                                            bonus parameter as well as several standard measures of
as necessary. Overall we have thus used three different
                                                            gene finding accuracy. Gene sensitivity (sn.) is a fraction of
hint sets:                                                  real genes that were predicted completely correctly. Gene
 – complex hint set (15,057 training hints, 5,545 test-     specificity (sp.) is the fraction of predicted genes that are
   ing hints)                                               completely correct. Similarly, we measure sensitivity and
 – interval hint set (39,900 training hints, 14,358 test-   specificity at the exon and base level, where we count com-
   ing hints)                                               pletely correctly predicted exons and individual symbols in
 – point-wise hint set (8,651,653 training hints,           coding exons.
   4,042,685 testing hints)
    Bonuses of hints have great influence on the gene       a random place in the sequence and suggest label for
prediction accuracy: if they are too low, the hints will    intergenic regions along their whole length. Some of
be ignored; if they are too high, wrong hints will de-      these hints are correct, since by chance they are con-
crease the gene prediction accuracy. For simplicity, all    tained in real intergenic regions, whereas others over-
hints in one hint set have the same bonus b, and we         lap real genes. Note that we have chosen intergenic
choose optimal value of b to maximize the accuracy on       hints, because it is non-trivial to randomly generate
the training set for each set of hints separately. The      other reasonable wrong hints that have a non-zero
exception are complex hints, where the bonus of a hint      probability in the HMM. Overall, 74% of hints in our
is bk where k is the number of contiguous regions in the    set were correct. In this experiment, we have compared
hint labeled by one label (or in other words the number     complex and interval hints, again training their bonus
of interval hints comprising this complex hint). Again      on the training set and testing their performance on
the value of b was optimized using the training set.        the testing set. Results shown in Table 2 are analogous
    Evaluation of these three hints sets on the test se-    to the protein hint experiment – slight improvement in
quences is shown in Table 1. Complex hints slightly         the prediction accuracy at the gene level for complex
increase the number of completely correctly predicted       hints compared with interval hints.
genes compared to the interval hints, but the differ-
ence is less than 1%. On the other hand, difference in                 Hint set           Interval    Complex
accuracy between prediction without hints and with                     Bonus                 6           12
one of the hint sets is quite significant.                             Gene sensitivity   60.75%       61.53%
    Figure 3 shows an example of a gene where com-                     Gene specificity   49.43%       50.41%
plex hints lead to a better prediction than interval                   Exon sensitivity   77.31%       77.92%
hints. The algorithm with interval hints chooses an                    Exon specificity   76.51%       76.92%
annotation agreeing with only two out of three parts                   Base sensitivity   94.98%       94.69%
comprising a single complex hint. With complex hints                   Base specificity   93.52%       93.39%
only annotations agreeing with the whole hint get the       Tab. 2. Comparison of gene prediction accuracy with ar-
bonus, and one such annotation is indeed the opti-          tificial hints sets, using the same accuracy measures as in
mum.                                                        Table 1.

Experiment with simulated hints. To better control
various parameters of the hint set, we have also gen-
erated artificial hints. All hints in this set have the     5     Conclusion and open problems
length 1000, and the number of hints was chosen so
that the sum of their lengths is approximately n/2.         In this paper, we have explored several variants of the
Half of the hints were created so that they start at        problem of finding optimal sequence annotation that
a random place within some exon and agree with the          agrees with hints with the highest total score. If the
reference annotation along their whole length. The sec-     hints completely specify annotation within some in-
ond half of hints was generated so that they start at       terval and all hints have positive scores, the problem
46       Marcel Kucharı́k, Jakub Kováč, Broňa Brejová

can be solved quite efficiently, in time O(md + ` + n)           6. R. Durbin, R. Eddy, A. Krogh, and G. Mitchison: Bio-
where n is the length of the annotated sequence, ` is               logical sequence analysis. Cambridge University Press,
the total length of all hints, m is the number of hints             1998.
and d is the maximum number of hints overlapping                 7. P.A. Fujita, B. Rhead, A.S. Zweig, et al. (2011). The
a single position. If some hints have negative scores,              UCSC Genome browser database: update 2011. Nucleic
                                                                    Acids Research, 39(Database issue), 2011, D876–882.
the problem can still be solved in polynomial time. Fi-
                                                                 8. D. Gusfield: Algorithms on strings, trees and se-
nally, we can also solve the case where hints are over              quences: computer science and computational biology.
two different alphabets, one being a partition of the               Cambridge University Press, 1997.
other. These algorithms can be also extended to com-             9. W.J. Kent: BLAT – the BLAST-like alignment tool.
bine hints with hidden Markov models or conditional                 Genome Research, 12(4), 2002, 656–664.
random fields. We have also shown that if we allow              10. J. Kováč, T. Vinař, and B. Brejová: Predicting gene
wildcards in hints, the problem becomes NP-hard even                structures from multiple RT-PCR tests. In Algorithms
for binary output alphabet.                                         in Bioinformatics, 9th International Workshop, WABI
    Our results might be applicable in various fields               2009, volume 5724 of LNCS, pp. 181–193. Springer,
where HMMs and their variants need to be combined                   2009.
                                                                11. P. Kováč: Implementácia externých zdrojov dát v
with external information. However, our original mo-
                                                                    hl’adanı́ génov. Bachelor Thesis, Department of Com-
tivation stems from gene finding, where complex and
                                                                    puter Science, Comenius University in Bratislava,
subset hints are the most natural form of express-                  2010.
ing hints from various sources. Our experiments with            12. M. Kucharı́k: A new algorithm for using external in-
a simple gene finder show that compared to simpler                  formation in gene finding. Master’s Thesis, Depart-
interval hints used before, complex hints may lead to               ment of Computer Science, Comenius University in
slight increases in prediction accuracy. More experi-               Bratislava. Submitted 2011.
ments with different information sources or in different        13. K.D. Pruitt, T. Tatusova, and D.R. Maglott: NCBI
species may lead to more significant improvements.                  reference sequences (RefSeq): a curated non-redundant
    Several open problems remain in this area. Our al-              sequence database of genomes, transcripts and pro-
                                                                    teins. Nucleic Acids Research, 35(Database issue),
gorithms are in the worst case quadratic in the number
                                                                    2007, D61–65.
of hints. The question is whether more efficient algo-
                                                                14. M. Stanke, M. Diekhans, R. Baertsch, and D. Haussler:
rithms exist at least for the case of interval hints. It            Using native and syntenically mapped cDNA align-
might be also useful to consider special classes of sub-            ments to improve de novo gene finding. Bioinformat-
set hints that can be processed in polynomial time. For             ics, 24(5), 2008, 637–644.
example in our earlier work the RT-PCR hints were               15. R. Šrámek, B. Brejová, and T. Vinař: On-line Viterbi
NP-hard in general but solvable efficiently if their rel-           algorithm for analysis of long biological sequences. In
ative position was constrained [10]. Finally, one could             Algorithms in Bioinformatics: 7th International Work-
also generalize complex hints in other ways, for ex-                shop (WABI), volume 4645 of Lecture Notes in Com-
ample, allowing some uncertainty in the exact place                 puter Science, pp. 240–251. Springer.
where labeling changes from one label to another.               16. R.F. Yeh, L.P. Lim, and C.B. Burge: Computational
                                                                    inference of homologous gene structures in the human
                                                                    genome. Genome Research, 11(5), 2001, 803–806.
References

 1. J.E. Allen, and S.L. Salzberg: . JIGSAW: integra-
    tion of multiple sources of evidence for gene prediction.
    Bioinformatics, 21(18), 2005, :3596–3603.
 2. B. Brejová, D.G. Brown, M. Li, and T. Vinař: Ex-
    onHunter: a comprehensive approach to gene finding.
    Bioinformatics, 21 Suppl 1:, 2005, 57–65.
 3. C. Burge, and S. Karlin: Prediction of complete gene
    structures in human genomic DNA. Journal of Molec-
    ular Biology, 268(1), 1997, 78–94.
 4. T.H. Cormen, C.E. Leiserson, and R.L. Rivest: Intro-
    duction to algorithms, second edition. The MIT Press
    and McGraw-Hill Book Company, 1989.
 5. D. DeCaprio, J.P. Vinson, M.D. Pearson, P. Mont-
    gomery, M. Doherty, and J.E. Galagan: Conrad: gene
    prediction using conditional random fields. Genome
    Research, 17(9), 2007, 1389–1398.