=Paper=
{{Paper
|id=None
|storemode=property
|title=Gene finding with complex external information
|pdfUrl=https://ceur-ws.org/Vol-788/paper6.pdf
|volume=Vol-788
|dblpUrl=https://dblp.org/rec/conf/itat/KucharikKB11
}}
==Gene finding with complex external information==
Gene finding with complex external information? Marcel Kucharı́k, Jakub Kováč, and Broňa Brejová Department of Computer Science, Faculty of Mathematics, Physics, and Informatics Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia Abstract. The goal of gene finding is to locate genes, account statistical properties of genic and non-genic which are important segments of DNA encoding proteins. regions of DNA, such as frequency of k-tuples, length Programs solving this task are based on hidden Markov distribution of genes, and characteristic sequence mo- models (HMMs) capturing statistical features extracted tifs at gene boundaries. Genes found by HMMs are of- from known genes, but often also incorporate hints about ten imprecise, and to increase the accuracy, statistical the correct gene structure extracted from experimental models of DNA are often combined with additional ex- data. Existing gene finding programs can use such external information only in a limited way. Typically, they can pro- ternal information provided by biological experiments. cess only simple hints describing a single part of the gene External information is often in the form of individual structure, because these are relatively easy to incorporate hints, each hint giving a probable location of one gene to standard HMM algorithms, but cannot cope with com- or a part of a gene. plex hints spanning multiple parts. We have developed an To combine an HMM with hints, we can manipu- efficient algorithm able to process such complex hints. Our late probabilities of individual labelings A so that the experiments show that this approach slightly increases the labeling gets a bonus for each hint agreeing with it. accuracy of gene prediction. We also prove that a more Different gene finders define the set of possible hints general class of hints leads to an NP-hard problem. and their influence on the probability space differently, but the possibilities are restricted by the need for ef- ficient algorithm finding the annotation with highest 1 Introduction probability in the modified space. The simplest form of a hint is a point-wise hint which increases or de- In this paper we study a combinatorial optimization creases probabilities of all annotations that have a cer- problem arising in computational biology. Although tain label at a certain position in the sequence [2, 16]. we were originally motivated by a very specific appli- Such hints can be easily incorporated to the standard cation, we formulate the problem quite generally and Viterbi algorithm for finding the most probable anno- explore its variants that admit efficient polynomial- tations in HMMs. time algorithms as well as those that are NP hard. However, a single piece of evidence often suggests Our motivation comes from the problem of gene labels for a longer interval of the sequence. Splitting finding. Here the goal is to locate segments of a DNA such information into multiple point-wise hints leads sequence that encode proteins produced by the organ- to information loss, because some labelings can get the ism. On a more abstract level, we are given a string bonus even if they do not agree with the evidence over X = x1 . . . xn and the goal is to produce a string the whole interval. Therefore, some gene finders [14, 5] A = a1 . . . an over some output alphabet Σ of possible use hints in the form of intervals such that the proba- labels, such that ai is a label of xi , corresponding to its bility of a labeling is increased only if it has the label functional role. In gene finding, X is the input DNA prescribed by the hint throughout the whole extent sequence (over alphabet {A, C, G, T }) and the output of the interval. We call such hints interval hints. The alphabet could be Σ = {0, 1} where 1 stands for genes main focus of our paper is to study further generaliza- and 0 for non-genic regions. We will call A a labeling tions of interval hints. or annotation of the input string X. For most of the time we will abstract away both The desired mapping from X to A is in gene finding the HMM and the input string X and consider only often characterized by a probabilistic model defining a set of hints. Indeed, probabilities from the HMM can probability distribution P (A|X). We then output the be expressed as special hints of length 2, as we explain labeling with the highest probability. Frequently used in Section 2. Therefore in the most general setting, we models are hidden Markov models (HMMs) or their can formulate our problem as follows: variants such as hidden semi-Markov models and con- ditional random fields [3, 5]. These models take into Definition 1 (Optimal labeling with hints prob- ? This research is funded by European Community FP7 lem). Given is positive integer n, finite alphabet Σ, grant IRG-231025, VEGA grant 1/0210/10, and APVV and a set of hints. Each hint is a pair (γ, b) where grant SK-CN-0007-09. γ ⊆ Σ n is a set of labelings and b ∈ R is a bonus. We 40 Marcel Kucharı́k, Jakub Kováč, Broňa Brejová say that a labeling A ∈ Σ n agrees with hint (γ, b) if form with running time O(n2 |Σ|2 ). With appropriate A ∈ γ. The score of a labeling A ∈ Σ n is the sum of data structures, it is possible to implement the algo- bonuses of all hints that agree with A. The goal is to rithm so that processing of hints adds only an additive find a labeling with the maximum score. term O(m) where m is the number of hints. However, a more efficient algorithm would be desirable if we In our work the hints will assume a special form wanted to use hints with regular HMMs, where the allowing compact representation of a potentially large Viterbi algorithm is linear in n. Running time of our set γ. In particular, a complex hint is a four-tuple algorithm for complex hints with positive bonuses will (s, e, Y, b) such that 1 ≤ s ≤ e ≤ n and Y ∈ Σ e−s+1 not depend on the sequence length, only on the num- is the labeling suggested for xs . . . xe (see Figure 1). ber and total length of hints. Set γ for this hint is γ = Σ s−1 Y Σ n−e . Interval hints are a special case of complex hints with Y = ak for some a ∈ Σ and point-wise hints are a special case of Complex hints with positive bonuses. In order to in- interval hints with s = e. troduce our algorithm for a set of complex hints with positive bonuses, we start with several necessary def- initions. Let h = (s, e, y1 . . . ys−e+1 , b) be a complex h1 = (1, 4, 0001, 2) 0001... hint. Then for i ∈ [s, e] expression `(h, i) denotes la- h2 = (2, 3, 00, 1) .00.... bel suggested by h for xi , that is, `(h, i) = yi−s+1 . h3 = (2, 6, 00100, 1) .00100. Now let us consider two hints h1 = (s1 , e1 , Y1 , b1 ) and h4 = (3, 6, 0001, 1) ..0001. h2 = (s2 , e2 , Y2 , b2 ). We say that h1 and h2 are com- h5 = (6, 7, 02, 1) .....02 patible if they agree in the intersection of their inter- A 0001002 vals, that is, if for every i ∈ [s1 , e1 ] ∩ [s2 , e2 ] we have Fig. 1. A set of complex hints for n = 7 and Σ = {0, 1, 2} `(h1 , i) = `(h2 , i) (non-intersecting hints are always and an optimal labeling with score 5, agreeing with hints compatible). Hint h1 is a subset of hint h2 if they are h1 , h2 , h3 and h5 . Note that h2 is an interval hint. compatible and [s1 , e1 ] ⊆ [s2 , e2 ]. Hint h2 is an ex- tension of h1 if they are compatible and e2 > e1 and In this work we describe an efficient algorithm for s2 > s1 . Note that if s1 = s2 , e1 = e2 , and Y1 = Y2 , finding the optimal labeling for a set of complex hints, these two hints can be replaced by a single hint with which is a non-trivial extension of algorithms for in- bonus b1 + b2 . We will assume that such a transforma- terval hints (Section 2). We also show that additional tion is done on all applicable pairs. natural generalization of the problem where some po- In our algorithm, we transform the input to create sitions between s and e are allowed to vary, leads to a weighted directed acyclic graph (DAG) such that the an NP-hard problem (Section 3). Finally in Section 4, optimal labeling corresponds to the path with maxi- we describe the use of our algorithm in the context mum weight between given two vertices s and t. Such of gene finding and show experimental results on real longest paths can be found in DAGs efficiently, in and simulated data. O(|V | + |E|) time [4]. Our algorithm could be also expressed directly as dynamic programming, but the 2 An algorithm for complex hints graph representation is more convenient for proving correctness and considering variants of the algorithm. In this section we give a polynomial-time algorithm Our graph has one vertex for each hint and two spe- which finds the optimal labeling for a set of complex cial vertices s and t. There is an edge from h1 to h2 if hints. We will first discuss several simpler cases, later and only if h2 is an extension of h1 . The weight of this generalizing the algorithm to the full problem. edge is the sum of the bonus of h2 and bonuses of all hints that are subset of h2 but are not a subset of h1 Interval hints. Gene finders employing interval hints (see Figure 2). Vertices s and t are connected with typically proceed by dynamic programming. For each other vertices as if they corresponded to sentinel inter- prefix x1 . . . xi of the input string X, they consider all vals beyond the end of the sequence, namely (0, 0, a, 0) possibilities for the last region of this prefix labeled by and (n + 1, n + 1, a, 0). Clearly the graph created in one label (that is, all possible j ≤ i and a ∈ Σ such this way is acyclic, because vertices on each path have that aj = aj+1 = · · · = ai = a and aj−1 6= a). For increasing coordinates of their endpoints. Correctness each such interval (j, i), it is easy to compute the sum of the algorithm is given by the following theorem. of bonuses of all interval hints agreeing with label a. ∗ This type of algorithm is dictated by the fact that Theorem 1. The weight W of the maximum-weight gene finders combine hints with hidden semi-Markov path from s to t in the graph described above is equal ∗ models or their variants [14, 5] and the Viterbi algo- to the score S of an optimal labeling for the input set rithm for inference in these models already has this of hints. Gene finding with complex external information 41 3 0 Our graph has O(m) vertices and O(m2 ) edges 0 0 0 1 where m is the number of hints. As we will show be- 1 0 0 0 low, it can be constructed in O(m2 + `) time, where ` 1 2 0 is the sum of lengths of all hints. The running time of s 0 0 1 0 0 t 1 1 0 the whole algorithm is therefore O(m2 + ` + n) where 0 0 0 1 the dependence on n comes only from the necessity to 1 1 0 produce a labeling of length n (using arbitrary labels 0 2 for parts of the sequence not covered by any hint). Fig. 2. Directed acyclic graph created by our algorithm for hint set from Figure 3. Vertices for hints are shown as grey boxes. Running time improvement for practical instances. In practical gene finding instances, we expect hints to Proof. First, we will prove that for every labeling A be much shorter than n and to be spread along the with score S there is a path with weight S and there- length of the sequence. We can modify the graph so fore W ∗ ≥ S ∗ . Let H be the set of hints that agree that it has fewer edges if every position is covered by with A. Let HS be the set of all hints from H that are at most d hints for some d < m. The construction a subset of another hint from H, and let HR = H\HS . shown above will create edges even between hints that Let h1 and h2 be two different hints from HR such that are far apart, and those can be eliminated. The weight s1 < s2 . Then also e1 < e2 because no hint in HR is a of an edge from hint h0 to hint h depends on mutual subset of another, and these two hints are compatible position of these hints, because we include only those because they both agree with A. Therefore, h2 is an subset hints of h that end after the end of h0 . However, extension of h1 . the weight will be the same for all hints h0 that do not Our path will start in s, pass through all vertices intersect h. Our goal is to remove such edges from in set HR ordered from left to right by their left end- the graph and replace them with a smaller number of points and end in t. As we have shown, all necessary edges linear in m. edges on the path exist. Bonus of hint h ∈ HR is in- cluded in the weight of the edge entering h and bonus We will say that a hint h2 is a strict extension of of hint h ∈ HS is included in the weight of the edge en- a hint h1 if h1 is an extension of h2 and the two hints tering the leftmost hint h0 ∈ HR such that h is a sub- overlap. Our new graph will have a vertex for each set of h0 . Bonuses of no other hints are included in hint, two special vertices s and t for sentinel hints the weight of the path, and therefore its weight is ex- and a skipping vertex vi for every position i which actly S. is a right endpoint of at least one hint (including po- Now we will prove that for any path π from s sition 0 where the sentinel hint for s ends). Skipping to t with weight W , there is a labeling with score at vertices are connected to a chain going from left to least W , implying that W ∗ ≤ S ∗ . Let H0 be the set of right with edges of weight 0. Vertex for a hint h is all hints corresponding to vertices of π except s and t. connected to all hints that are its strict extensions, We will construct a labeling A = a1 . . . an as follows. with the same edge weights as before. It is also con- If a position i is not covered by any hint in H0 , its nected to the skipping vertex for its right endpoint label ai can be chosen arbitrarily. If i is covered by with an edge of weight 0. Finally, each skipping ver- some hints h1 , . . . , hk from H0 , they suggest the same tex vi is connected to all hints that start in the interval label for position i and this label is chosen as ai . (i, j] where vj is its neighbor in the chain of skipping Now let H00 be the set containing all hints from H0 vertices. The edge from vi to h will contain the bonus and all hints that are subsets of hints from H0 . The of h and all its subset bonuses. Paths in this new graph score S of path π is the sum of bonuses of hints have 1-1 correspondence with the paths in the origi- from H00 . Also, each hint from this set agrees with A. nal graph, where an edge between two non-overlapping The score of A is the sum of all bonuses that agree hints is now replaced by a path through skipping ver- with it, therefore S ≤ W . t u tices. The number of vertices is O(m) and the number of edges O(md), since from a vertex for hint h there We have proved that the weight of the optimal path are at most d outgoing edges: one to a skipping ver- is equal to the score of the optimal labeling and the tex and at most d − 1 to other hints, because all these proof also implies an algorithm to convert the optimal hints cover the right endpoint of h. Similarly the num- path to a labeling with the same score. Note however ber of incoming edges is at most d. Finally, the num- that weights of some suboptimal paths do not corre- ber of edges between skipping vertices is O(m). The spond to the score of any labeling, as a labeling that overall time of the algorithm with this graph will be agrees with all vertices on a path may also agree with O(md + ` + n); we still need to demonstrate that the other hints that the path avoids. graph can be constructed within this running time. 42 Marcel Kucharı́k, Jakub Kováč, Broňa Brejová The critical ingredient is the ability to check in con- with positive bonuses. Each hint (s, e, Y, b) with a neg- stant time whether two hints are compatible. This can ative bonus b is transformed to a set of hints of the be achieved by building a suffix tree of the strings asso- form (s, e0 , Y 0 , |b|) such that s ≤ e0 ≤ e and Y 0 is ciated with individual hints and preprocessing the suf- the same as Y on the first e0 − s positions and differs fix tree for longest extension queries. These queries al- from Y on its last position. We create all (|Σ| − 1) low us to take two suffixes of any of the strings and de- (e − s + 1) combinations of e0 and Y 0 of this form. termine the length of their common prefix. In our sce- Clearly, these hints are mutually incompatible, and nario we take two hints (s1 , e1 , Y1 , b1 ), (s2 , e2 , Y2 , b2 ) therefore at most one of them will agree with any an- and if they overlap by some length d > 0, we de- notation. An annotation A agrees with one of these termine the positions of the start of the overlap in hints if and only if it does not agree with the original strings Y1 and Y2 . If the suffixes starting at these two hint h. This transformation increases the score of ev- positions have a common prefix of length at least d, ery annotation by |b|: the annotations agreeing with h the two hints are compatible. They are also compati- will increase by |b| because hint h is omitted from the ble if d = 0. The preprocessing of necessary structures set and all other annotations agree with one of the new can be done in O(`) time, and compatibility of each hints, thus getting an additional bonus of |b|. When we pair of hints can be then assessed in O(1) time [8]. transform all hints with negative values in this way, To find the edges between hint vertices, we first we increase the total score of every labeling by the sort all hints by left endpoints. This can be done in sum of absolute values of all negative bonuses. This O(n + m) time by counting sort. Next, we traverse the is a constant and therefore the optimal labeling will sorted list and maintain a set of active hints. A hint remain the same and can be found by the algorithm is active if it overlaps the left endpoint of the current described above. Unfortunately, the number of hints hint in the list. In every step, the list contains at most m as well as their total length ` increase by a factor d hints, and therefore we can in each iteration traverse of L|Σ| where L is the length of the longest negative the list and remove hints that are no longer active. hint, but the running time is still polynomial. The current hint h is a strict extension of any active hint h0 such that h0 and h are compatible and h is not Combination of complex hints with an HMM. In gene a subset of h0 . This can be checked in O(1) time per finding, we typically (although not exclusively [1]) pair of hints. If the current hint h is a subset of some combine the score from hints with a probability value active hint h0 , we will append h to a list of subsets from some model capturing typical sequence features of h0 . of genes and their constituents. Probability of a par- At the end of traversal, we have for each hint h the ticular labeling A in hidden Markov models or condi- list of its subset hints and the list of incoming edges tional random fields can be written as from other hint vertices. We sort both these lists by n right endpoints by counting sort in time proportional Y P (A) = p(i, ai , ai+1 , X), to the length of h (the right endpoints of all these i=1 hints are within h). Now we use a merge-like algorithm with two pointers to the two lists to find for every where factors p(i, ai , ai+1 , X) depend only on two ad- incoming edge from hint h0 the bonus sum of subset jacent labels ai and ai+1 (and some portion of the hints that end after the end of h0 . Edges incident to input string X). They are computed from the emis- skipping vertices can be easily constructed within the sion and transition probabilities of the model. We will desired running time. The overall running time of the consider the case when the total score of A has the graph construction is thus O(md + ` + n) where the form n term comes from sorting and can be replaced by n O(m log m). X B + log P (A) = B + log p(i, ai , ai+1 , X), i=1 Complex hints with arbitrary bonuses. The algorithm described above is efficient, but can only handle hints where B is the sum of bonuses of all agreeing hints. The with positive bonus values. If a hint h has a nega- log values in the sum can be written as hints of length 2 tive bonus value, the path could skip its vertex in the of the form (i, i + 1, ai ai+1 , log p(i, ai , ai+1 , X)). Over- graph, even if other visited vertices imply that the la- all, we will need n|Σ|2 such hints and they typically all beling agrees with h and its bonus should be counted. have negative score. After applying the transformation The graph may thus contain paths with weight higher described in the previous paragraphs and combining that the score of the optimal labeling. together hints with the same values of s, e and Y, we As we will show, we can transform a set of hints get O(n|Σ|2 ) additional hints of length 1 and 2 with with arbitrary bonuses to an equivalent set of hints positive bonuses. Gene finding with complex external information 43 This generic procedure works only if the states of corresponds to jth literal from ith clause, where pi,j = the HMM correspond exactly to the symbols of the N +3(i−1)+j. For each such literal we will add an as- output alphabet Σ. Gene finding HMMs typically have signment hint of the form (k, pi,j , z{0, 1}pi,j −k−1 1, 1) a much larger state space, with several states cor- where k is the index of the variable forming this literal responding to the same output label. This situation and z = 1 if the literal is yk and z = 0 if the literal could be expressed using more general subset hints de- is ¬yk . Notation {0, 1}c represents c copies of the set scribed in Section 3. Although subset hints lead in gen- {0, 1} ∈ Σ+ . In other words, the hint connects the lit- eral to NP-hard problems, special cases necessary for eral with its satisfying variable assignment, and if the handling large state spaces can be solved efficiently, as literal is selected by api,j = 1 and the variable yk has discussed at the end of the next section. Moreover, it is the satisfying assignment, the labeling gets a bonus 1. also possible to create specialized algorithms for com- We will also add three selector hints for every clause, bining complex hints with a hidden Markov model [12]. that enforce that exactly one literal from the clause By taking the special structure of the problem into ac- is selected by api,j = 1. These hints have the form count, we may obtain more efficient solutions than by (pi,1 , pi,3 , Y, K) where Y is a binary string of length converting the HMM to a collection of hints. three containing exactly two zeroes. Note that for each clause at most one of the selector hints can agree with a given labeling. 3 NP-hardness for subset hints In order to achieve total score M + M K, the an- notation has to agree with one selector hint for ev- A subset hint is a four-tuple h = (s, e, Y, b) such that ery clause. The loss of even one such selector could numbers s, e, and b are defined as for a complex hint not be compensated because the sum of bonuses of all and Y is a string of length e − s + 1 over an extended assignment hints is only 3M , which is less than K. alphabet Σ+ containing all non-empty subsets of Σ. Selector bonuses that contribute bonus M K enforce Expression `(h, i) now denotes the set of labels sug- that exactly one literal in each clause is selected. The gested by h for xi and a labeling A agrees with h if for selected literal has a potential to contribute bonus 1 every position i ∈ [s, e] the label ai is in the set `(h, i). if it is indeed satisfied in the assignment specified by Unfortunately, our algorithm for complex hints can- the labeling of the first N symbols. If the formula has not be straightforwardly extended to this scenario, be- score at least M + M K, every clause has exactly one cause the relation of extension is not transitive, which literal which is simultaneously selected and satisfied. is crucial for the existence of a labeling satisfying all Note that multiple literals per clause may be satisfied, hints on a path in our graph. but if they are not selected, they will not influence As we will prove, it is unlikely that the problem the score. Annotation with score M + M K thus corre- can be solved efficiently in the full generality by other sponds to a satisfying assignment of the formula and methods, since it is NP-hard. vice versa. t u Theorem 2. The problem of testing whether there is This proof is a slight modification of our earlier a labeling A with score at least t for a set of subset proof for the use of RT-PCR queries in gene find- hints is NP-complete, even for a binary alphabet Σ. ing [10]. In this earlier work we have considered hints of a special form that suggest a specific label at both Proof. Clearly the problem is in NP (if the size of in- ends and allow arbitrary labels in the middle. How- put is much smaller than n, we can specify only por- ever the output labeling was constrained to obey an tions of A covered by hints and this is sufficient for additional DAG of possible gene structures. Here we efficient computation of the score). have adapted the proof for the simpler case where the To prove NP-hardness, we will use a reduction from annotation can be arbitrary. the 3-SAT problem. Consider a 3-SAT instance with N variables y1 , . . . , yN and M clauses, each clause con- Partition subset hints. Subset hints often occur in sisting of three literals. We will encode it as a set of practice because our methods for obtaining additional hints over the alphabet Σ = {0, 1} such that the for- information about the labeling cannot distinguish be- mula has a satisfying assignment if and only if there tween some labels from Σ. In some situations we can is a labeling with score at least M + M K where K = represent subset hints as complex hints over an alpha- 3M + 1. bet Σ 0 ⊆ Σ+ that forms a partition of Σ into equiva- The length of the annotated sequence will be n = lence classes (in other words, every two elements in Σ 0 N + 3M . The first N labels correspond to a truth as- are disjoint subsets of Σ). We will now consider a situ- signment to all variables, that is, ai = 1 corresponds ation where every hint contains only characters from Σ to yi being true. Each of the remaining labels cor- or only characters from Σ 0 (but cannot use charac- responds to one literal from the formula, namely api,j ters from both). We will show that optimal labeling 44 Marcel Kucharı́k, Jakub Kováč, Broňa Brejová (over Σ) for such hint sets can be found in polynomial is relatively well sequenced and annotated. Genomic time. sequences and reference annotations were downloaded Briefly, we first create a directed acyclic graph G from the UCSC genome browser [7]. We have used for hints over Σ similarly as in Section 2. We add new 44 MB of DNA sequence containing 5 164 genes. They hints of the form (i, i, a, 0) for all positions i and labels were split to 2 MB parts and divided to train- a ∈ Σ. Both the original hints and these new hints ing (≈ 28 MB, 3368 genes) and testing (≈ 16 MB, will be vertices. We will have an edge from hint h1 1796 genes) data sets. to hint h2 if h2 is an extension of h1 and they ei- The HMM has been taken from the work of [15] ther overlap or h2 starts immediately after h1 ends. and retrained on our training data set by standard In this graph, every path consists of hints that com- procedures [6]. It has 265 states, but it is quite simple pletely cover the sequence. In the same way we also compared to state-of-the-art gene finders. For exam- build a graph G0 for hints over Σ 0 . Finally, we create ple, it does not allow arbitrary length distributions of graph G00 in which each vertex is a pair (h, h0 ) such exons, introns and intergenic regions and uses simpler that h is a vertex in G, h0 is a vertex in G0 , hints h models of sequence motifs at exon boundaries. In our and h0 overlap and are compatible. If hint h0 ends be- experiments we compare the accuracy achieved on the fore h, this vertex will be connected to vertices of the testing set by the HMM alone and with different sets form (h, h00 ) where (h0 , h00 ) is an edge in G0 , the weight of hints. of the edge is also copied from G0 . Similarly, if h ends before h0 , (h, h0 ) will be connected to vertices of the form (h00 , h0 ) where (h, h00 ) is an edge in G. Finally, Protein hints if the two hints end together, they will be connected Gene 1 Gene 2 to vertices of the form (h00 , h000 ) where (h, h00 ) is an Reference edge in G and (h0 , h000 ) is an edge in G0 . The weight of Interval hints the new edge will be the sum of the two source edges. The best path in G00 then corresponds to the optimal Complex hints annotation (proof omitted due to space). No hints The algorithm can be extended to several groups of hints, each over a different partition of Σ, but the 1226000 1227000 1228000 1229000 running time would increase exponentially in the num- ber k of such groups, because the vertices in the final Fig. 3. Example of a prediction obtained with different graph will be k-tuples of vertices of graphs for individ- protein hint sets in a region containing two genes. Lines de- pict introns, rectangles exons (gene 1 has one exon, gene 2 ual groups of hints. has four exons). Four identical complex hints span the whole length of gene 1 and suggest short intergenic re- 4 Use of complex hints in gene finding gion beyond its boundary. Prediction with complex hints is correct. Prediction with interval hints does not agree with We have implemented a variant of the algorithm from the intergenic hint at the right end of the gene. Prediction Section 2 and applied it to the problem of gene finding. without hints predicts a long gene extending beyond the displayed region. Recall that the goal of gene finding is to find genes, regions encoding proteins in DNA sequences. A gene in eukaryotic organisms can be divided into several Experiment with protein hints. In the first experiment, segments called exons and introns, and only exons en- we have used a real set of hints originating from code the protein (see Figure 3). Thus the goal of gene a database of known proteins. We have downloaded all finding is to find the position of each gene and its ex- known proteins from several insect species (D. melano- act partitioning to exons and introns. We use output gaster, D. simulans, D. pseudoobscura, Anopheles alphabet Σ = {C, I, X}, where C stands for coding gambiae, and Bombyx mori ) from the NCBI RefSeq exons, I for introns, and X for intergenic regions. database [13] (67, 893 protein sequences in total). We The algorithm was implemented in C++ as a stan- have used the BLAT program [9] and scripts from Ex- dalone console application called grapHMM. In order onHunter distribution [2, 11] to find regions in the test- to process long DNA sequences, we implemented sev- ing set that could encode identical or similar proteins eral practical improvements. For example, the graph to those in the database. Weak matches were were is generated on the fly and unnecessary information is filtered out to increase the specificity. As a result of immediately deleted to free up memory. this strict filtering, only 11% of the sequence is cov- We have tested the program on genomic data from ered by hints, whereas genes cover almost 45% of the Drosophila melanogaster (fruit fly). This species is an sequence. Hints contain exon and intron labels (ex- important model organism in genetics and its genome ons are regions where the protein seems to be encoded Gene finding with complex external information 45 and introns are separating parts encoding the same Hint set No hints Point-wise Interval Complex protein). If the whole source protein could be aligned Bonus – 8 8 6 to the genome, a single intergenic position is added Gene sn. 48.83% 61.08% 61.30% 61.86% to the beginning and to the end of the complete gene Gene sp. 40.40% 46.00% 46.42% 46.61% Exon sn. 71.18% 75.89% 76.20% 76.14% hint. Exon sp. 69.01% 71.73% 72.31% 72.29% Our goal was to compare prediction accuracy ob- Base sn. 92.81% 94.35% 94.15% 94.14% tained using complex hints with accuracy obtained us- Base sp. 91.86% 92.00% 92.10% 92.08% ing simpler interval and point-wise hints. Therefore we have converted our hint set to these simpler forms by Tab. 1. Comparison of hint sets created from protein se- quences. For each set, we show the trained value of the splitting each complex hints to several shorter hints bonus parameter as well as several standard measures of as necessary. Overall we have thus used three different gene finding accuracy. Gene sensitivity (sn.) is a fraction of hint sets: real genes that were predicted completely correctly. Gene – complex hint set (15,057 training hints, 5,545 test- specificity (sp.) is the fraction of predicted genes that are ing hints) completely correct. Similarly, we measure sensitivity and – interval hint set (39,900 training hints, 14,358 test- specificity at the exon and base level, where we count com- ing hints) pletely correctly predicted exons and individual symbols in – point-wise hint set (8,651,653 training hints, coding exons. 4,042,685 testing hints) Bonuses of hints have great influence on the gene a random place in the sequence and suggest label for prediction accuracy: if they are too low, the hints will intergenic regions along their whole length. Some of be ignored; if they are too high, wrong hints will de- these hints are correct, since by chance they are con- crease the gene prediction accuracy. For simplicity, all tained in real intergenic regions, whereas others over- hints in one hint set have the same bonus b, and we lap real genes. Note that we have chosen intergenic choose optimal value of b to maximize the accuracy on hints, because it is non-trivial to randomly generate the training set for each set of hints separately. The other reasonable wrong hints that have a non-zero exception are complex hints, where the bonus of a hint probability in the HMM. Overall, 74% of hints in our is bk where k is the number of contiguous regions in the set were correct. In this experiment, we have compared hint labeled by one label (or in other words the number complex and interval hints, again training their bonus of interval hints comprising this complex hint). Again on the training set and testing their performance on the value of b was optimized using the training set. the testing set. Results shown in Table 2 are analogous Evaluation of these three hints sets on the test se- to the protein hint experiment – slight improvement in quences is shown in Table 1. Complex hints slightly the prediction accuracy at the gene level for complex increase the number of completely correctly predicted hints compared with interval hints. genes compared to the interval hints, but the differ- ence is less than 1%. On the other hand, difference in Hint set Interval Complex accuracy between prediction without hints and with Bonus 6 12 one of the hint sets is quite significant. Gene sensitivity 60.75% 61.53% Figure 3 shows an example of a gene where com- Gene specificity 49.43% 50.41% plex hints lead to a better prediction than interval Exon sensitivity 77.31% 77.92% hints. The algorithm with interval hints chooses an Exon specificity 76.51% 76.92% annotation agreeing with only two out of three parts Base sensitivity 94.98% 94.69% comprising a single complex hint. With complex hints Base specificity 93.52% 93.39% only annotations agreeing with the whole hint get the Tab. 2. Comparison of gene prediction accuracy with ar- bonus, and one such annotation is indeed the opti- tificial hints sets, using the same accuracy measures as in mum. Table 1. Experiment with simulated hints. To better control various parameters of the hint set, we have also gen- erated artificial hints. All hints in this set have the 5 Conclusion and open problems length 1000, and the number of hints was chosen so that the sum of their lengths is approximately n/2. In this paper, we have explored several variants of the Half of the hints were created so that they start at problem of finding optimal sequence annotation that a random place within some exon and agree with the agrees with hints with the highest total score. If the reference annotation along their whole length. The sec- hints completely specify annotation within some in- ond half of hints was generated so that they start at terval and all hints have positive scores, the problem 46 Marcel Kucharı́k, Jakub Kováč, Broňa Brejová can be solved quite efficiently, in time O(md + ` + n) 6. R. Durbin, R. Eddy, A. Krogh, and G. Mitchison: Bio- where n is the length of the annotated sequence, ` is logical sequence analysis. Cambridge University Press, the total length of all hints, m is the number of hints 1998. and d is the maximum number of hints overlapping 7. P.A. Fujita, B. Rhead, A.S. Zweig, et al. (2011). The a single position. If some hints have negative scores, UCSC Genome browser database: update 2011. Nucleic Acids Research, 39(Database issue), 2011, D876–882. the problem can still be solved in polynomial time. Fi- 8. D. Gusfield: Algorithms on strings, trees and se- nally, we can also solve the case where hints are over quences: computer science and computational biology. two different alphabets, one being a partition of the Cambridge University Press, 1997. other. These algorithms can be also extended to com- 9. W.J. Kent: BLAT – the BLAST-like alignment tool. bine hints with hidden Markov models or conditional Genome Research, 12(4), 2002, 656–664. random fields. We have also shown that if we allow 10. J. Kováč, T. Vinař, and B. Brejová: Predicting gene wildcards in hints, the problem becomes NP-hard even structures from multiple RT-PCR tests. In Algorithms for binary output alphabet. in Bioinformatics, 9th International Workshop, WABI Our results might be applicable in various fields 2009, volume 5724 of LNCS, pp. 181–193. Springer, where HMMs and their variants need to be combined 2009. 11. P. Kováč: Implementácia externých zdrojov dát v with external information. However, our original mo- hl’adanı́ génov. Bachelor Thesis, Department of Com- tivation stems from gene finding, where complex and puter Science, Comenius University in Bratislava, subset hints are the most natural form of express- 2010. ing hints from various sources. Our experiments with 12. M. Kucharı́k: A new algorithm for using external in- a simple gene finder show that compared to simpler formation in gene finding. Master’s Thesis, Depart- interval hints used before, complex hints may lead to ment of Computer Science, Comenius University in slight increases in prediction accuracy. More experi- Bratislava. Submitted 2011. ments with different information sources or in different 13. K.D. Pruitt, T. Tatusova, and D.R. Maglott: NCBI species may lead to more significant improvements. reference sequences (RefSeq): a curated non-redundant Several open problems remain in this area. Our al- sequence database of genomes, transcripts and pro- teins. Nucleic Acids Research, 35(Database issue), gorithms are in the worst case quadratic in the number 2007, D61–65. of hints. The question is whether more efficient algo- 14. M. Stanke, M. Diekhans, R. Baertsch, and D. Haussler: rithms exist at least for the case of interval hints. It Using native and syntenically mapped cDNA align- might be also useful to consider special classes of sub- ments to improve de novo gene finding. Bioinformat- set hints that can be processed in polynomial time. For ics, 24(5), 2008, 637–644. example in our earlier work the RT-PCR hints were 15. R. Šrámek, B. Brejová, and T. Vinař: On-line Viterbi NP-hard in general but solvable efficiently if their rel- algorithm for analysis of long biological sequences. In ative position was constrained [10]. Finally, one could Algorithms in Bioinformatics: 7th International Work- also generalize complex hints in other ways, for ex- shop (WABI), volume 4645 of Lecture Notes in Com- ample, allowing some uncertainty in the exact place puter Science, pp. 240–251. Springer. where labeling changes from one label to another. 16. R.F. Yeh, L.P. Lim, and C.B. Burge: Computational inference of homologous gene structures in the human genome. Genome Research, 11(5), 2001, 803–806. References 1. J.E. Allen, and S.L. Salzberg: . JIGSAW: integra- tion of multiple sources of evidence for gene prediction. Bioinformatics, 21(18), 2005, :3596–3603. 2. B. Brejová, D.G. Brown, M. Li, and T. Vinař: Ex- onHunter: a comprehensive approach to gene finding. Bioinformatics, 21 Suppl 1:, 2005, 57–65. 3. C. Burge, and S. Karlin: Prediction of complete gene structures in human genomic DNA. Journal of Molec- ular Biology, 268(1), 1997, 78–94. 4. T.H. Cormen, C.E. Leiserson, and R.L. Rivest: Intro- duction to algorithms, second edition. The MIT Press and McGraw-Hill Book Company, 1989. 5. D. DeCaprio, J.P. Vinson, M.D. Pearson, P. Mont- gomery, M. Doherty, and J.E. Galagan: Conrad: gene prediction using conditional random fields. Genome Research, 17(9), 2007, 1389–1398.