=Paper= {{Paper |id=None |storemode=property |title=Variants of Genes from the Next Generation Sequencing Data |pdfUrl=https://ceur-ws.org/Vol-1003/44.pdf |volume=Vol-1003 |dblpUrl=https://dblp.org/rec/conf/itat/KravecBBV13 }} ==Variants of Genes from the Next Generation Sequencing Data== https://ceur-ws.org/Vol-1003/44.pdf
ITAT 2013 Proceedings, CEUR Workshop Proceedings Vol. 1003, pp. 44–51
http://ceur-ws.org/Vol-1003, Series ISSN 1613-0073, c 2013 M. Kravec, M. Bobák, B. Brejová, T. Vinař



               Variants of Genes from the Next Generation Sequencing Data

                                 Martin Kravec, Martin Bobák, Broňa Brejová, and Tomáš Vinař

         Faculty of Mathematics, Physics, and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia

Abstract: A typical next generation sequencing platform                  lem of sequence assembly is formally transformed into
produces DNA sequence data of a very fragmented nature,                  a problem of following a path in a graph. Several prac-
consisting of many short overlapping reads that need to be               tical assembly programs have been developed based on
assembled into a longer DNA sequence by an assembly                      this idea (see e.g., [Zerbino and Birney, 2008]). Since
program. There are many families of genes that evolve by                 some regions are not necessarily covered by sequencing
gene duplication. In the genomes, such genes have several                reads (sequencing gaps), and some regions are too sim-
copies that are very similar to each other, and therefore                ilar to each other in order to be distinguished by short
they pose a difficult challenge for sequence assembly pro-               reads (repetitive sequences), the resulting assembly is of-
grams. In this paper, we present a method for recovering                 ten fragmented into contigs, that cannot be further assem-
variants of genes from the NGS data without need for as-                 bled without resolving ambiguities or closing the gaps by
sembly of the whole DNA sequence. We show that our                       additional sequencing. To overcome some of these prob-
problem is NP-hard, but also demonstrate that in practice                lems, information from paired reads can be used, how-
it can be solved by integer linear programming.                          ever, a systematic use of the paired reads has been in-
                                                                         corporated into deBruijn graph framework only recently
                                                                         (see e.g., [Medvedev et al., 2011, Bankevich et al., 2012,
1    Introduction                                                        Pham et al., 2013]).
                                                                            Assuming that we already have an assembled sequence
The next generation sequencing (NGS) has brought much
                                                                         template, another problem is to align all sequencing reads
cheaper and much faster technologies for obtaining DNA
                                                                         from a given experiment to the template. Due to over-
sequences of various organisms. In the process of se-
                                                                         whelming amount of data, fast text search methods are
quencing, the technology generates many short reads (sub-
                                                                         used (e.g., FM index [Ferragina and Manzini, 2001]) to
sequences of the original DNA sequence), which are of-
                                                                         find perfect or almost perfect matches of reads to the tem-
ten sequenced as pairs with known ordering, orientation,
                                                                         plate sequence (see e.g. [Li et al., 2008a, Li et al., 2008b,
and approximate gap length in the original sequence (see
                                                                         Lin et al., 2008, Langmead et al., 2009]).
Fig.1). For example, typical reads from Illumina HiSeq
sequencing technology are approx. 100 bp long and the                       Software for all of these tasks has now become part of
length of the gap between paired reads is approx. 60 bp.                 a standard bioinformatics toolbox for processing and anal-
   The result of genome sequencing is thus a giant puz-                  ysis of the NGS data. However, a single sequencing run
zle of overlapping pieces that need to be assembled into                 can produce up to 50 GB of data [Minoche et al., 2011],
the original sequence. The most popular methods for                      and consequently time and memory requirements of these
assembling short reads are based on creating so called                   tools impose limitations on researchers working with NGS
deBruijn graphs [Pevzner et al., 2001], where the prob-                  data.
                                                                            This has motivated research into use of NGS data with-
                                                                         out assembling or aligning reads, focusing on sequence
                                                                         patterns on a local scale rather than globally. Targeted as-
                                                                         semblers use short regions as seeds to assemble short re-
                                                                         gions of interest to the researcher [Warren and Holt, 2011,
                                                                         Peterlongo and Chikhi, 2012]. [Philippe et al., 2011] have
                                                                         developed a data structure for querying large read collec-
                                                                         tions in main memory. [Zhai et al., 2012] have developed
                                                                         a framework for approximating the distribution of word
Figure 1: Illustration of genome sequencing data. The re-                occurrences in NGS data without assembly or alignment.
sult of the sequencing are many short overlapping reads                  GapFiller [Nadalin et al., 2012] utilizes paired reads to lo-
(thin lines, top) from the original sequence (green line,                cally fill the gaps in the sequence assembly.
bottom) that can be assembled into longer contigs (thick                    In this paper, we will also take this local approach and
black lines). Some regions cannot be assembled due to                    examine a problem of discovering variants of a given gene
low coverage, resulting in sequencing gaps. Some prob-                   from the NGS data. Our problem is motivated by large
lems can be solved by considering paired read information                gene families which arise by repeatedly copying a gene
(brackets), which discloses order, orientation, and the dis-             throughout the evolution. An extant genome may con-
tance between paired reads.                                              tain several copies of a gene at varying levels of similarity,
Variants of Genes from the NGS Data                                                                                           45


since each copy will mutate independently. We call such                                      TGCTTC
gene copies variants of a gene.
   Variants of a gene, especially those that are very simi-
                                                                                          CCTTGC
lar to each other, pose a large problem to NGS assembly                                        GTTCTT
tools. The individual variations are often mistaken for se-                               CCTTGG TCTTCA
quencing errors, or the copies may appear indistinguish-




                                                                          Variant Graph
able through the short reads, effectively creating an ambi-                               ACTTGCTGCTTCA
                                                                                             (reference sequence)
guity in the assembly. Consequently, variants of a gene are
often collapsed into a single sequence segment that does
not fit very well into the rest of the assembly. The input                                   C    C

to our method is an unassembled NGS data set and a gene
template. We filter the data set for reads that are similar                                       G
to the template and use a new algorithm to organize these                                              T
reads into several variants that explain the variability in the
data.
   The organization of the paper is as follows. In Section                 Figure 2: Example of a graph of variants
2, we examine the problem more closely and formulate it
as a simple graph problem. In Section 3, we show that
our formulation is NP-hard. In Section 4, we propose an           same way as non-paired reads (i.e., we create an edge for
integer linear programming (ILP) solution and in Section          each pair of alternatives covered by a pair of reads). We
5, we provide an algorithm that reduces the size of the ILP.      will assume that the gap between the paired reads is not
Finally, we demonstrate that our approach can be applied          longer than the standard read length. We will also assume
to real data and that the ILP solver (CPLEX) can be used          that the sequencing coverage is large; consequently, all
to effectively find the variants of a gene.                       possible connections up to some distance will be present
                                                                  in the data. (These assumptions indeed hold in typical se-
                                                                  quencing data sets.) This means that if three vertices u, v,
2    Problem of Finding Gene Variants                             and w represent alternatives within the same gene variant
                                                                  such that (u, w) is an edge and (v, w) is an edge, then (u, v)
In this section, we formulate the problem of finding gene         must also be an edge in the variant graph.
variants as a graph problem. The input to the problem is a           An individual gene variant can be represented as a path
sequence of a reference gene (for example, one gene from          through the graph of variants that passes through corre-
a large gene family), and the sequencing reads that align         sponding alternatives at variable positions. Now consider
to this reference sequence. While at most positions, the          all reads originating from a given gene variant. These
bases in the sequencing reads will likely agree with the ref-     reads induce a set of edges in the graph. Only some of
erence sequence, there will be some positions where some          these edges are included in the path for the variant, others
of the reads disagree with the reference sequence and with        connect more distant positions along the path. For these
each other. We will call these positions variable positions       edges, we need a notion of explained edges.
and different bases occurring at a variable position will be
called alternatives.                                              Definition 2 (Explaining edges). Path p explains edge
                                                                  (u, v) of variant graph G if both u and v lie on path p.
Definition 1 (Graph of variants). The graph of variants is
a directed acyclic graph with several groups of vertices,            In theory, every edge in the variant graph originates
each group corresponding to one variable position. Each           from some variant of the gene and thus it should be ex-
alternative at a variable position is represented by a single     plained by the path for this variant. Using Occam’s razor,
vertex in the graph. There is an edge between two vertices        we seek to explain all edges using a small number of paths.
if and only if there exists a sequencing read that contains       This leads to the following formulation of the problem of
both alternatives corresponding to these two vertices. All        finding gene variants in the genome sequencing data.
edges are directed from left to right according to the posi-
tion of the corresponding alternative.                            Problem 1 (Finding Gene Variants). Given a graph of
                                                                  variants G, find the smallest set of paths that explain all
   Figure 2 shows a simple example of the graph of vari-          edges in G.
ants. The individual reads differ from the reference be-
cause they are sequenced from different copies of that               Note that the data may give us only incomplete infor-
particular genes. The alternatives represent mutations            mation about variants of genes. If, for example, there is
and connections between these individual mutations in the         a long region within the gene without any variable posi-
reads will allow us to disentangle original gene copies.          tions, there will be no connections between alternatives on
   Our approach also works with pair-end read pairs. In the       the left and the right side of this region. The graph of vari-
variant graph construction, we approach such reads in the         ants will be split into connected components and there will
46                                                                                                                       M. Kravec, M. Bobák, B. Brejová, T. Vinař


be no path going from left to right. Yet, the above problem                                     the smallest number of paths explaining all edges in the
is well defined even in such cases, and the resulting set of                                    corresponding graph is 2z + 2.
paths gives us at least some information on partial vari-
ants of genes. However, the interpretation of such results                                      Lemma 1. If the 3CNF formula is satisfiable, there ex-
is more difficult. In the rest of the paper, we will assume                                     ist 2z + 2 paths explaining all edges in the corresponding
that the graph of variants is connected.                                                        graph.

                                                                                                Proof. Consider a satisfying assignment. For each vari-
3      Finding Gene Variants is NP-hard                                                         able x j , we will have a pair of paths, that will pass through
                                                                                                all cross gadgets corresponding to that variable in crossing
Unfortunately, the problem of finding gene variants de-                                         (in case that x j is true) or avoiding configuration (in case
fined in the previous section is NP-hard. We will prove this                                    that x j is false). These paths will explain all blue edges
by reducing 3-SAT problem to the problem of finding gene                                        except for those involving vertices u1 and u2 , as well as all
variants. Consider a formula in with clauses c1 , . . . , cn ,                                  green edges, since a particular path will consistently pass
where each clause ci consists of three literals `i,1 , `i,2 , `i,3 ,                            each cross gadget corresponding to x j in the same way.
and each literal is either a variable from a set of variables                                   Moreover, these 2z paths will also explain at least one of
Z = {x1 , . . . , xz }, or its negation.                                                        the three red edges at each level, since each clause must
    We will construct a graph representing this formula as                                      have at least one satisfied literal. The remaining edges
an instance of finding gene variants problem. An example                                        (those that pass through u1 and u2 , and at most two red
of this construction is shown in Fig.3.                                                         edges at each level corresponding to unsatisfied literals)
    For each clause ci and variable x j , we will have a                                        are easily explained using two additional paths. Therefore
gadget of five vertices vi, j,1 , . . . , vi, j,5 . These vertices                              2z + 2 paths suffice to explain all edges in the graph.
are connected by edges forming “a cross gadget” as fol-
lows: (vi, j,1 , vi, j,3 ), (vi, j,2 , vi, j,3 ), (vi, j,3 , vi, j,4 ), (vi, j,3 , vi, j,5 ).
                                                                                                Lemma 2. If the 3CNF formula is not satisfiable, explain-
There will be an additional level of these cross gadgets
                                                                                                ing all edges in the corresponding graph requires at least
that do not correspond to any clause (denote them v0, j,∗ ),
                                                                                                2z + 3 paths.
and there will be additional vertices u1 and u2 in this ze-
roth level, and vertices w1 , . . . , wn separating the levels for
                                                                                                Proof. Assume that we explain all edges by at most 2z + 2
individual clauses. Finally, there will be one designated
                                                                                                paths. In order to explain the edges at zeroth level, we
source vertex s and a sink vertex t.
                                                                                                need to have two paths that explain edges of the cross gad-
    Starting vertex s will be connected to the ver-
                                                                                                get corresponding to each variable x j at level 0; the two
tices in the leftmost level by edges (s, v0, j,1 ) and
                                                                                                additional paths must be used to explain blue edges con-
(s, v0, j,2 ) for j = 1, . . . , z, and edges (s, u1 ), (s, u2 ).
                                                                                                nected to vertices u1 and u2 and consequently they cannot
Connections between individual levels go through
                                                                                                be used to explain any green edges in the graph. Thus, we
the separating vertices wi , i.e.                                  we have edges
                                                                                                only have two paths that are usable for explaining green
(vi−1, j,4 , wi ), (vi−1, j,5 , wi ), (wi , vi, j,1 ), (wi , vi, j,2 ), and at the
                                                                                                edges connecting to all the other cross gadgets correspond-
zeroth level also (u1 , w1 ), (u2 , w1 ). Finally, gadgets at the
                                                                                                ing to variable x j . From this it follows that the two paths
last level connect to the sink vertex: (vn, j,4 ,t), (vn, j,5 ,t).
                                                                                                must pass through all cross gadgets corresponding to x j in
We will call all these edges blue edges.
                                                                                                crossing or avoiding formation, and consequently they can
    Green           edges          will        be          used         to         enforce
                                                                                                be interpreted as an assignment of the truth values to the
consistency             between              the         gadgets            (see        be-
                                                                                                variables. Since this assignment cannot satisfy the 3CNF
low).             These will connect vertices as follows:
                                                                                                formula, there must be at least one level where all three
(v0, j,1 , vi, j,1 ), (v0, j,2 , vi, j,2 ), (v0, j,4 , vi, j,4 ), (v0, j,5 , vi, j,5 ).
                                                                                                red edges remain unexplained. However, we only have
Finally, red edges will be used to encode the formula in
                                                                                                two paths (those passing through u1 and u2 ) that could be
the graph. If clause ci contains literal x j , there will be
                                                                                                used to explain those three red edges. This leads to a con-
edge (vi, j,1 , vi, j,5 ). If clause ci contains literal ¬x j , we will
                                                                                                tradiction and proves, that in such a case we require at least
have edge (vi, j,1 , vi, j,4 ).
                                                                                                2z + 3 paths.
    Intuitively, paths in the solution will encode the satis-
fying assignment (if the formula can be satisfied). If the
variable x j is true in the satisfying assignment, then we                                         Together, these two lemmas have shown that the edges
will have two paths passing through each gadget concern-                                        of a graph constructed from the 3CNF formula can be ex-
ing x j : vi, j,1 7→ vi, j,3 7→ vi, j,5 and vi, j,2 7→ vi, j,3 7→ vi, j,4 .                     plained by 2z + 2 paths if and only if the formula is satisfi-
We call this a crossing path configuration. On the other                                        able. In other words, we have shown by a reduction from
hand, if the variable x j is false, we have two paths: vi, j,1 7→                               3-SAT:
vi, j,3 7→ vi, j,4 and vi, j,2 7→ vi, j,3 7→ vi, j,5 . We call this an
avoiding path configuration. Below we will use this intu-                                       Corollary 1 (NP-hardness). The problem of finding gene
ition to prove that the formula is satisfiable if and only if                                   variants is NP-hard.
Variants of Genes from the NGS Data                                                                                                             47


                               3CNF formula:                       ( a ∨ ¬c ∨ d ) ∧ ( ¬b ∨ d ∨ ¬e ) ∧ ... ∧ ( ¬a ∨ b ∨ e )



                                            v0,1,3
                                   v0,1,1            v0,1,4
                           a       v0,1,2            v0,1,5
Variables of the formula




                           b
                                                              w1                  w2
                               s                                                                                  w3                             t
                           c                                                                                         ...
                                                                                                                       wn


                           d


                           e

                                               u1
                                               u2


Figure 3: Example of a construction of a graph of variants from the 3CNF formula. Blue edges (regular width) form
and connect individual gadgets, while green edges (thin) enforce consistency between individual gadgets, and red edges
(thick) encode the formula.


4                          Integer Linear Programming
                           Formulation
                                                                                       ∀i : ∀v ∈ V −Vs : zi,v =      ∑ xi,(u,v)                (1)
                                                                                                                  (u,v)∈E

                                                                                       ∀i : ∀v ∈ V −Vt : zi,v =      ∑ xi,(v,w)                (2)
                                                                                                                  (v,w)∈E
In the previous section, we have shown that Problem 1                                  ∀i : ∑ zi,v ≤ 1                                         (3)
is NP-hard. Here, we propose that the problem can be                                       v∈Vs
effectively solved by an integer linear programming (ILP)                              ∀i : ∑ zi,v ≤ 1                                         (4)
solver.                                                                                    v∈Vt

   For a given graph of variants G(V, E), denote Vs ⊆ V set                                                         1
                                                                                       ∀i : ∀(u, v) ∈ E : yi,(u,v) ≤ (zi,u + zi,v )            (5)
of vertices without input edges (source vertices) and Vt ⊆                                                          2
                                                                                                       k
V set of vertices without output edges (sink vertices). We
                                                                                       ∀(u, v) ∈ E : ∑ yi,(u,v) ≥ 1                            (6)
will assume that the minimum number of paths explaining
                                                                                                     i=1
all edges E is smaller than some number k; this can be set
                                                                                       ∀i : ∑ idv zi,v ≥ ∑ idv zi+1,v                          (7)
for example as the number of edges in G, but in practice
                                                                                           v∈V             v∈V
much smaller values of k will suffice.
                                                                                       ∀i : ∀(u, v) ∈ E : xi,(u,v) ∈ {0, 1}                    (8)
   We will have three types of 0-1 variables, each set of                              ∀i : ∀(u, v) ∈ E : yi,(u,v) ∈ {0, 1}                    (9)
variables replicated for each possible path. For path num-
ber i (1 ≤ i ≤ k), variable xi,e = 1 if edge e is included                             ∀i : ∀v ∈ V : zi,v ∈ {0, 1}                           (10)
in the path; variable yi,e = 1 if edge e is explained by the
                                                                                          where idv is a unique identifier of vertex v. Conditions
path; variable zi,v = 1 if vertex v is included in the path.
                                                                                       (1)-(4) ensure consistency between assignments of X and
   Our goal is to minimize the number of paths, which can                              Z variables and at the same time, they enforce that the ob-
be, for example, computed as the number of source ver-                                 ject represented by the i-th group of variables is indeed a
tices used in the solution ∑ki=1 ∑v∈Vs zi,v . This sum is min-                         path. Condition (5) determines explained edges and con-
imized subject to the following conditions:                                            dition (6) assures that each edge is explained by at least
48                                                                                                                                                M. Kravec, M. Bobák, B. Brejová, T. Vinař


one path. Finally, condition (7) helps to break symme-
                                                                                              A na: 802   A na: 818                                                                    A na: 802   A na: 818
                                                                                                 555         415                                                                          555         415




tries leading to many equivalent solutions (problems hav-
ing many symmetric solutions may be difficult to solve).              C na: 765
                                                                        2682
                                                                                  C na: 780
                                                                                    2664
                                                                                                                          C na: 828
                                                                                                                             779
                                                                                                                                      C na: 834
                                                                                                                                        4314
                                                                                                                                                  C na: 876
                                                                                                                                                    4661
                                                                                                                                                              C na: 765
                                                                                                                                                                2682
                                                                                                                                                                           C na: 780
                                                                                                                                                                             2664
                                                                                                                                                                                                               C na: 828
                                                                                                                                                                                                                  779
                                                                                                                                                                                                                           C na: 834
                                                                                                                                                                                                                             4314
                                                                                                                                                                                                                                        C na: 876
                                                                                                                                                                                                                                          4661




   This ILP has O((m + n)k) variables and conditions.
Variables xi,e fully determine the solution; the other vari-
ables are auxiliary and can be uniquely determined from                                       G na: 802
                                                                                                4846
                                                                                                          G na: 818
                                                                                                            4871
                                                                                                                                                                                       G na: 802
                                                                                                                                                                                         4846
                                                                                                                                                                                                   G na: 818
                                                                                                                                                                                                     4871




xi,e . In the next section, we show how to reduce the num-
ber of determining variables which will increase the effi-            T na: 765   T na: 780                               T na: 828   T na: 834   T na: 876   T na: 765    T na: 780                           T na: 828   T na: 834    T na: 876




ciency of the ILP in practice.
                                                                        2794        2971                                    4296         761         386        2794         2971                                4296         761          386




5 Reduced Graph of Variants                                           Figure 4: Example graph of variants (left) and reduced
                                                                      graph of variants (right).
In this section, we introduce an algorithm that will reduce
the number of edges in the graph of variants. In particular,          for edges (u, v) ∈ Er , because the remaining edges will not
we will only be interested in the subset of edges that are            be used by the optimal set of paths. We will change the
needed for the optimal solution.                                      conditions as follows:
Definition 3 (Reduced graph of variants). Let                         Eq. (1)                       ∀i : ∀v ∈ V −Vs : zi,v =                                              ∑ xi,(u,v)                                                   (11)
{p1 , p2 , . . . , pk } be the optimal set of paths explain-                                                                                                    (u,v)∈Er
ing all edges in the graph of variants. Reduced graph of
variants is a union of all edges of paths p1 , . . . , pk . In case
                                                                      Eq. (2)                       ∀i : ∀v ∈ V −Vt : zi,v =                                              ∑ xi,(v,w)                                                   (12)
                                                                                                                                                                (v,w)∈Er
of multiple optimal solutions we choose the smallest such
set of edges.                                                         Eq. (8)                       ∀i : ∀(u, v) ∈ Er : xi,(u,v) ∈ {0, 1}                                                                                              (13)

   Although computing the optimal set of paths is hard,               We also add one more condition stating that each edge
union of edges on these paths can be found by a simple                from Er must be used on some path:
algorithm. A detour for an edge (vi , v j ) is a path of length                                                       k
at least two connecting vi and v j . If an edge e does not have       ∀(u, v) ∈ Er : ∑ xi,(u,v) ≥ 1                                                                                                                                    (14)
a detour, it can be explained only by paths contaning e, and                                                     i=1
thus it must be included in the reduced graph. Conversely,
if e has a detour, any path in the optimal set can use edges          6 Experiments and Results
in the detour instead of e. Any optimal set of paths can be
thus transformed to contain only edges without a detour               We tested our approach on Illumina sequencing data from
and these edges form the reduced graph.                               yeast species Magnusiomyces magnusii (unpublished).
   To find the reduced graph, we number the vertices of the           For out tests, we have selected FDH (formate dehydroge-
variant graph v1 , . . . vn from left to right. Let Mi be the set     nase) gene that exhibits variation in the number of copies
of vertices, from which vertex vi is reachable. Algorithm             in related species (for example, two copies in S. cere-
1 computes sets Mi for all vertices vi and simultaneously             visiae, three copies in C. albicans, and ten copies in Y.
selects edges for the reduced graph of variants.                      lipolytica). The sequence assembly by assembly program
   For each vertex vi , we compute set Mi by exploring in-            Velvet [Zerbino and Birney, 2008] contained several frag-
coming edges (v j , vi ) in the descending order by v j . If v j      ments of FDH-like genes (likely originating from differ-
is not already in Mi , then we add all vertices from M j to           ent gene copies), but all of them were apparently incom-
Mi . In addition, there is no detour for (v j , vi ), because such    plete. We have manually assembled some of these frag-
a detour would have to enter vi by and edge (vk , vi ) with           ments and created a putative 1150 bp long template of
k > j. Since set Mk for such k contains v j , v j would be            the FDH gene. By a rather complicated process using
already in Mi when we encounter edge (v j , vi ), which is            blastx [Altschul et al., 1997], blat [Kent, 2002], and Vel-
a contradiction. Therefore we select edge (v j , vi ) for the         vet [Zerbino and Birney, 2008], we have selected 74 694
reduced graph. On the other hand, if v j is already in Mi , it        read pairs similar to the FDH template. The average length
can be easily seen that a detour must exist for (v j , vi ).          of a read is 100.8 bp, and the expected length of the gap
   The running time of this algorithm is O(n3 ) due set ma-           between the paired reads is 60.
nipulation on line 8. Instead of this simple algorithm, we                The individual reads were then aligned to the FDH tem-
could also use faster transitive closure computation based            plate by blat. Based on read alignments, we have iden-
on fast matrix multiplication, but this algorithm is suffi-           tified all variable positions and alternatives. To avoid se-
cient for our purposes. Example of a reduced graph of                 quencing errors, we have discarded all alternatives that ac-
variants can be seen in Figure 4.                                     counted for less than 10% of reads at each variable posi-
   Given a reduced graph of variants Gr (Vr , Er ), we can            tion. The ILP was solved by CPLEX optimization library,
modify the ILP program by including variables xi,(u,v) only           the maximum number of paths was set to 30.
Variants of Genes from the NGS Data                                                                                             49


Algorithm 1 Algorithm for reducing the variant graph
 1: Input: Variant graph G(V,E)                                                       . vertices of V are in topological order
 2: selectedEdges = {}
 3: for i = 1 . . . n do
 4:     Mi = {vi }
 5:     for j = i − 1 . . . 1 ; (v j , vi ) ∈ E do                    . iterate through incoming edges in descending order
 6:         if v j 6∈ Mi then
 7:               Mi = Mi ∪ M j
 8:               selectedEdges = selectedEdges ∪{(v j , vi )}
 9:         end if
10:     end for
11: end for
12: return selectedEdges



   Figure 5 shows the result of our experiment. The num-         gene and may help to pinpoint the correct number of vari-
ber of edges in the graph of variants can be significantly       ants.
reduced, thus speeding up the ILP. The algorithm has dis-
covered six variants of FDH gene in M. magnusii genome.          Acknowledgements. We thank prof. Jozef Nosek for ac-
Our experiments shows that even though, in general, the          cess to the unpublished sequencing data and for suggesting
problem is NP-hard, it is possible to apply the ILP on the       FDH gene as an interesting target for study. This research
real data.                                                       was supported by VEGA grant 1/1085/12 and by Come-
                                                                 nius University young researcher grant UK/425/2013.

7    Conclusion and Future Work
                                                                 References
In this paper, we have introduced a problem of recovering
all variants of a given gene from unassembled next gen-          [Altschul et al., 1997] Altschul, S. F., Madden, T. L., Schaffer,
                                                                    A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J.
eration sequencing data. We have formulated the problem
                                                                    (1997). Gapped BLAST and PSI-BLAST: a new genera-
as a graph problem, shown that the problem is NP-hard,
                                                                    tion of protein database search programs. Nucleic Acids Res,
and proposed a practical solution with the integer linear           25(17):3389–3392.
program. We have shown that our solution can indeed be
                                                                 [Bankevich et al., 2012] Bankevich, A., Nurk, S., Antipov, D.,
applied to real data, and we have detected six variants of          Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M.,
the FDH gene in M. magnusii genome.                                 Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V.,
   There are several problems open for future work. In              Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A., and
this work, we have used pair-end reads, i.e. pairs of short         Pevzner, P. A. (2012). SPAdes: a new genome assembly algo-
reads that are separated by a short gap (in this work, we           rithm and its applications to single-cell sequencing. J Comput
have assumed that the gap between the two paired reads              Biol, 19(5):455–457.
is shorter than the read length). On the other hand, mate-       [Ferragina and Manzini, 2001] Ferragina, P. and Manzini, G.
pair sequencing produces paired reads that are at a much            (2001). An experimental study of an opportunistic index. In
longer distance. How do we incorporate such reads into              Proceedings of the Twelfth Annual ACM-SIAM Symposium on
our framework?                                                      Discrete algorithms (SODA), pages 269–278.
   Here, we have considered variants where only a sin-           [Kent, 2002] Kent, W. J. (2002). BLAT–the BLAST-like align-
                                                                    ment tool. Genome Res, 12(4):656–664.
gle base has changed at a particular position. However,
other variants, such as short insertions or deletions, or even   [Langmead et al., 2009] Langmead, B., Trapnell, C., Pop, M.,
                                                                    and Salzberg, S. L. (2009). Ultrafast and memory-efficient
larger scale changes, do occur in real data and they need
                                                                    alignment of short DNA sequences to the human genome.
to be accounted for.
                                                                    Genome Biol, 10(3):R25.
   One information that we ignored is the number of reads
                                                                 [Li et al., 2008a] Li, H., Ruan, J., and Durbin, R. (2008a). Map-
overlapping each position of the template sequence. This            ping short DNA sequencing reads and calling variants using
number is correlated with the number of copies of a par-            mapping quality scores. Genome Res, 18(11):1851–1858.
ticular segment of the sequence. Such correlations can           [Li et al., 2008b] Li, R., Li, Y., Kristiansen, K., and Wang, J.
be used to help to decide between alternative solutions             (2008b). SOAP: short oligonucleotide alignment program.
(each variant should have the same coverage over its whole          Bioinformatics, 24(5):713–714.
length).                                                         [Lin et al., 2008] Lin, H., Zhang, Z., Zhang, M. Q., Ma, B., and
   Finally, sequences flanking the template sequence on             Li, M. (2008). ZOOM! Zillions of oligos mapped. Bioinfor-
each side will also differ between individual copies of a           matics, 24(21):2431–2437.
50                                                                                                                                                                        M. Kravec, M. Bobák, B. Brejová, T. Vinař




              A na: 105               A na: 232                                       A na: 573                                       A na: 802   A na: 818                           A na: 871
                3076                    1586                                            3016                                             555         415                                3312




                          C na: 231               C na: 240   C na: 396   C na: 399               C na: 756   C na: 765   C na: 780                           C na: 828   C na: 834               C na: 876   C na: 966
                            1563                    2135        3196        3159                    2396        2682        2664                                 779        4314                    4661        1805




              G na: 105               G na: 232                                       G na: 573                                       G na: 802   G na: 818                           G na: 871
                2142                    4769                                            1727                                            4846        4871                                1698




                          T na: 231               T na: 240   T na: 396   T na: 399               T na: 756   T na: 765   T na: 780                           T na: 828   T na: 834               T na: 876   T na: 966
                            4776                    4360        1901        1910                    3161        2794        2971                                4296         761                     386        2105




              A na: 105               A na: 232                                       A na: 573                                       A na: 802   A na: 818                           A na: 871
                3076                    1586                                            3016                                             555         415                                3312




                          C na: 231               C na: 240   C na: 396   C na: 399               C na: 756   C na: 765   C na: 780                           C na: 828   C na: 834               C na: 876   C na: 966
                            1563                    2135        3196        3159                    2396        2682        2664                                 779        4314                    4661        1805




              G na: 105               G na: 232                                       G na: 573                                       G na: 802   G na: 818                           G na: 871
                2142                    4769                                            1727                                            4846        4871                                1698




                          T na: 231               T na: 240   T na: 396   T na: 399               T na: 756   T na: 765   T na: 780                           T na: 828   T na: 834               T na: 876   T na: 966
                            4776                    4360        1901        1910                    3161        2794        2971                                4296         761                     386        2105




              A na: 105               A na: 232                                       A na: 573                                       A na: 802   A na: 818                           A na: 871
                3076                    1586                                            3016                                             555         415                                3312




                          C na: 231               C na: 240   C na: 396   C na: 399               C na: 756   C na: 765   C na: 780                           C na: 828   C na: 834               C na: 876   C na: 966
                            1563                    2135        3196        3159                    2396        2682        2664                                 779        4314                    4661        1805




              G na: 105               G na: 232                                       G na: 573                                       G na: 802   G na: 818                           G na: 871
                2142                    4769                                            1727                                            4846        4871                                1698




                          T na: 231               T na: 240   T na: 396   T na: 399               T na: 756   T na: 765   T na: 780                           T na: 828   T na: 834               T na: 876   T na: 966
                            4776                    4360        1901        1910                    3161        2794        2971                                4296         761                     386        2105




Figure 5: Variants of FDH gene in M. magnusii genome. (top) The full graph of variants; only alternatives accounting for
more than 10% of aligned genes were included. (middle) Reduced graph of variants significantly reduces the number of
edges. (bottom) Six variants discovered by the ILP; each variant is shown in a different color.
Variants of Genes from the NGS Data                                 51


[Medvedev et al., 2011] Medvedev, P., Pham, S., Chaisson, M.,
   Tesler, G., and Pevzner, P. (2011). Paired de bruijn graphs: a
   novel approach for incorporating mate pair information into
   genome assemblers. J Comput Biol, 18(11):1625–1634.
[Minoche et al., 2011] Minoche, A. E., Dohm, J. C., and Him-
   melbauer, H. (2011). Evaluation of genomic high-throughput
   sequencing data generated on Illumina HiSeq and genome an-
   alyzer systems. Genome Biol, 12(11):R112.
[Nadalin et al., 2012] Nadalin, F., Vezzi, F., and Policriti, A.
   (2012). GapFiller: a de novo assembly approach to fill the gap
   within paired reads. BMC Bioinformatics, 13 Suppl 14:S8.
[Peterlongo and Chikhi, 2012] Peterlongo, P. and Chikhi, R.
   (2012). Mapsembler, targeted and micro assembly of large
   NGS datasets on a desktop computer. BMC Bioinformatics,
   13:48.
[Pevzner et al., 2001] Pevzner, P. A., Tang, H., and Waterman,
   M. S. (2001). An Eulerian path approach to DNA fragment
   assembly. Proc Natl Acad Sci U S A, 98(17):9748–9753.
[Pham et al., 2013] Pham, S. K., Antipov, D., Sirotkin, A.,
   Tesler, G., Pevzner, P. A., and Alekseyev, M. A. (2013). Path-
   set graphs: a novel approach for comprehensive utilization of
   paired reads in genome assembly. J Comput Biol, 20(4):359–
   361.
[Philippe et al., 2011] Philippe, N., Salson, M., Lecroq, T.,
   Leonard, M., Commes, T., and Rivals, E. (2011). Querying
   large read collections in main memory: a versatile data struc-
   ture. BMC Bioinformatics, 12:242.
[Warren and Holt, 2011] Warren, R. L. and Holt, R. A. (2011).
   Targeted assembly of short sequence reads. PLoS One,
   6(5):e19816.
[Zerbino and Birney, 2008] Zerbino, D. R. and Birney, E.
   (2008). Velvet: algorithms for de novo short read assembly
   using de Bruijn graphs. Genome Res, 18(5):821–829.
[Zhai et al., 2012] Zhai, Z., Reinert, G., Song, K., Waterman,
   M. S., Luan, Y., and Sun, F. (2012). Normal and compound
   poisson approximations for pattern occurrences in NGS reads.
   J Comput Biol, 19(6):839–844.