<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Variants of Genes from the Next Generation Sequencing Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Kravec</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Bobák</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇ a Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Vinarˇ</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics, and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1003</volume>
      <fpage>44</fpage>
      <lpage>51</lpage>
      <abstract>
        <p>A typical next generation sequencing platform produces DNA sequence data of a very fragmented nature, consisting of many short overlapping reads that need to be assembled into a longer DNA sequence by an assembly program. There are many families of genes that evolve by gene duplication. In the genomes, such genes have several copies that are very similar to each other, and therefore they pose a difficult challenge for sequence assembly programs. In this paper, we present a method for recovering variants of genes from the NGS data without need for assembly of the whole DNA sequence. We show that our problem is NP-hard, but also demonstrate that in practice it can be solved by integer linear programming. The next generation sequencing (NGS) has brought much cheaper and much faster technologies for obtaining DNA sequences of various organisms. In the process of sequencing, the technology generates many short reads (subsequences of the original DNA sequence), which are often sequenced as pairs with known ordering, orientation, and approximate gap length in the original sequence (see Fig.1). For example, typical reads from Illumina HiSeq sequencing technology are approx. 100 bp long and the length of the gap between paired reads is approx. 60 bp. The result of genome sequencing is thus a giant puzzle of overlapping pieces that need to be assembled into the original sequence. The most popular methods for assembling short reads are based on creating so called deBruijn graphs [Pevzner et al., 2001], where the prob-</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        lem of sequence assembly is formally transformed into
a problem of following a path in a graph. Several
practical assembly programs have been developed based on
this idea
        <xref ref-type="bibr" rid="ref17">(see e.g., [Zerbino and Birney, 2008])</xref>
        . Since
some regions are not necessarily covered by sequencing
reads (sequencing gaps), and some regions are too
similar to each other in order to be distinguished by short
reads (repetitive sequences), the resulting assembly is
often fragmented into contigs, that cannot be further
assembled without resolving ambiguities or closing the gaps by
additional sequencing. To overcome some of these
problems, information from paired reads can be used,
however, a systematic use of the paired reads has been
incorporated into deBruijn graph framework only recently
        <xref ref-type="bibr" rid="ref10 ref11 ref14 ref15 ref18 ref2 ref9">(see e.g., [Medvedev et al., 2011, Bankevich et al., 2012,
Pham et al., 2013])</xref>
        .
      </p>
      <p>
        Assuming that we already have an assembled sequence
template, another problem is to align all sequencing reads
from a given experiment to the template. Due to
overwhelming amount of data, fast text search methods are
used
        <xref ref-type="bibr" rid="ref3">(e.g., FM index [Ferragina and Manzini, 2001])</xref>
        to
find perfect or almost perfect matches of reads to the
template sequence
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">(see e.g. [Li et al., 2008a, Li et al., 2008b,
Lin et al., 2008, Langmead et al., 2009])</xref>
        .
      </p>
      <p>Software for all of these tasks has now become part of
a standard bioinformatics toolbox for processing and
analysis of the NGS data. However, a single sequencing run
can produce up to 50 GB of data [Minoche et al., 2011],
and consequently time and memory requirements of these
tools impose limitations on researchers working with NGS
data.</p>
      <p>This has motivated research into use of NGS data
without assembling or aligning reads, focusing on sequence
patterns on a local scale rather than globally. Targeted
assemblers use short regions as seeds to assemble short
regions of interest to the researcher [Warren and Holt, 2011,
Peterlongo and Chikhi, 2012]. [Philippe et al., 2011] have
developed a data structure for querying large read
collections in main memory. [Zhai et al., 2012] have developed
a framework for approximating the distribution of word
occurrences in NGS data without assembly or alignment.
GapFiller [Nadalin et al., 2012] utilizes paired reads to
locally fill the gaps in the sequence assembly.</p>
      <p>In this paper, we will also take this local approach and
examine a problem of discovering variants of a given gene
from the NGS data. Our problem is motivated by large
gene families which arise by repeatedly copying a gene
throughout the evolution. An extant genome may
contain several copies of a gene at varying levels of similarity,
since each copy will mutate independently. We call such
gene copies variants of a gene.</p>
      <p>Variants of a gene, especially those that are very
similar to each other, pose a large problem to NGS assembly
tools. The individual variations are often mistaken for
sequencing errors, or the copies may appear
indistinguishable through the short reads, effectively creating an
ambiguity in the assembly. Consequently, variants of a gene are
often collapsed into a single sequence segment that does
not fit very well into the rest of the assembly. The input
to our method is an unassembled NGS data set and a gene
template. We filter the data set for reads that are similar
to the template and use a new algorithm to organize these
reads into several variants that explain the variability in the
data.</p>
      <p>The organization of the paper is as follows. In Section
2, we examine the problem more closely and formulate it
as a simple graph problem. In Section 3, we show that
our formulation is NP-hard. In Section 4, we propose an
integer linear programming (ILP) solution and in Section
5, we provide an algorithm that reduces the size of the ILP.
Finally, we demonstrate that our approach can be applied
to real data and that the ILP solver (CPLEX) can be used
to effectively find the variants of a gene.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem of Finding Gene Variants</title>
      <p>In this section, we formulate the problem of finding gene
variants as a graph problem. The input to the problem is a
sequence of a reference gene (for example, one gene from
a large gene family), and the sequencing reads that align
to this reference sequence. While at most positions, the
bases in the sequencing reads will likely agree with the
reference sequence, there will be some positions where some
of the reads disagree with the reference sequence and with
each other. We will call these positions variable positions
and different bases occurring at a variable position will be
called alternatives.</p>
      <p>Definition 1 (Graph of variants). The graph of variants is
a directed acyclic graph with several groups of vertices,
each group corresponding to one variable position. Each
alternative at a variable position is represented by a single
vertex in the graph. There is an edge between two vertices
if and only if there exists a sequencing read that contains
both alternatives corresponding to these two vertices. All
edges are directed from left to right according to the
position of the corresponding alternative.</p>
      <p>Figure 2 shows a simple example of the graph of
variants. The individual reads differ from the reference
because they are sequenced from different copies of that
particular genes. The alternatives represent mutations
and connections between these individual mutations in the
reads will allow us to disentangle original gene copies.</p>
      <p>Our approach also works with pair-end read pairs. In the
variant graph construction, we approach such reads in the</p>
      <sec id="sec-2-1">
        <title>TGCTTC CCTTGC GTTCTT CCTTGG TCTTCA</title>
      </sec>
      <sec id="sec-2-2">
        <title>ACTTGCTGCTTCA</title>
        <p>(reference sequence)
C</p>
        <p>C
G</p>
        <p>T
same way as non-paired reads (i.e., we create an edge for
each pair of alternatives covered by a pair of reads). We
will assume that the gap between the paired reads is not
longer than the standard read length. We will also assume
that the sequencing coverage is large; consequently, all
possible connections up to some distance will be present
in the data. (These assumptions indeed hold in typical
sequencing data sets.) This means that if three vertices u, v,
and w represent alternatives within the same gene variant
such that (u, w) is an edge and (v, w) is an edge, then (u, v)
must also be an edge in the variant graph.</p>
        <p>An individual gene variant can be represented as a path
through the graph of variants that passes through
corresponding alternatives at variable positions. Now consider
all reads originating from a given gene variant. These
reads induce a set of edges in the graph. Only some of
these edges are included in the path for the variant, others
connect more distant positions along the path. For these
edges, we need a notion of explained edges.</p>
        <p>Definition 2 (Explaining edges). Path p explains edge
(u, v) of variant graph G if both u and v lie on path p.</p>
        <p>In theory, every edge in the variant graph originates
from some variant of the gene and thus it should be
explained by the path for this variant. Using Occam’s razor,
we seek to explain all edges using a small number of paths.
This leads to the following formulation of the problem of
finding gene variants in the genome sequencing data.
Problem 1 (Finding Gene Variants). Given a graph of
variants G, find the smallest set of paths that explain all
edges in G.</p>
        <p>Note that the data may give us only incomplete
information about variants of genes. If, for example, there is
a long region within the gene without any variable
positions, there will be no connections between alternatives on
the left and the right side of this region. The graph of
variants will be split into connected components and there will
be no path going from left to right. Yet, the above problem
is well defined even in such cases, and the resulting set of
paths gives us at least some information on partial
variants of genes. However, the interpretation of such results
is more difficult. In the rest of the paper, we will assume
that the graph of variants is connected.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Finding Gene Variants is NP-hard</title>
      <p>Unfortunately, the problem of finding gene variants
defined in the previous section is NP-hard. We will prove this
by reducing 3-SAT problem to the problem of finding gene
variants. Consider a formula in with clauses c1, . . . , cn,
where each clause ci consists of three literals `i,1, `i,2, `i,3,
and each literal is either a variable from a set of variables
Z = {x1, . . . , xz}, or its negation.</p>
      <p>We will construct a graph representing this formula as
an instance of finding gene variants problem. An example
of this construction is shown in Fig.3.</p>
      <p>For each clause ci and variable x j, we will have a
gadget of five vertices vi, j,1, . . . , vi, j,5. These vertices
are connected by edges forming “a cross gadget” as
follows: (vi, j,1, vi, j,3), (vi, j,2, vi, j,3), (vi, j,3, vi, j,4), (vi, j,3, vi, j,5).
There will be an additional level of these cross gadgets
that do not correspond to any clause (denote them v0, j,∗),
and there will be additional vertices u1 and u2 in this
zeroth level, and vertices w1, . . . , wn separating the levels for
individual clauses. Finally, there will be one designated
source vertex s and a sink vertex t.</p>
      <p>Starting vertex s will be connected to the
vertices in the leftmost level by edges (s, v0, j,1) and
(s, v0, j,2) for j = 1, . . . , z, and edges (s, u1), (s, u2).
Connections between individual levels go through
the separating vertices wi, i.e. we have edges
(vi−1, j,4, wi), (vi−1, j,5, wi), (wi, vi, j,1), (wi, vi, j,2), and at the
zeroth level also (u1, w1), (u2, w1). Finally, gadgets at the
last level connect to the sink vertex: (vn, j,4, t), (vn, j,5, t).
We will call all these edges blue edges.</p>
      <p>Green edges will be used to enforce
consistency between the gadgets (see
below). These will connect vertices as follows:
(v0, j,1, vi, j,1), (v0, j,2, vi, j,2), (v0, j,4, vi, j,4), (v0, j,5, vi, j,5).
Finally, red edges will be used to encode the formula in
the graph. If clause ci contains literal x j, there will be
edge (vi, j,1, vi, j,5). If clause ci contains literal ¬x j, we will
have edge (vi, j,1, vi, j,4).</p>
      <p>Intuitively, paths in the solution will encode the
satisfying assignment (if the formula can be satisfied). If the
variable x j is true in the satisfying assignment, then we
will have two paths passing through each gadget
concerning x j: vi, j,1 7→ vi, j,3 7→ vi, j,5 and vi, j,2 7→ vi, j,3 7→ vi, j,4.
We call this a crossing path configuration. On the other
hand, if the variable x j is false, we have two paths: vi, j,1 7→
vi, j,3 7→ vi, j,4 and vi, j,2 7→ vi, j,3 7→ vi, j,5. We call this an
avoiding path configuration. Below we will use this
intuition to prove that the formula is satisfiable if and only if
the smallest number of paths explaining all edges in the
corresponding graph is 2z + 2.</p>
      <p>Lemma 1. If the 3CNF formula is satisfiable, there
exist 2z + 2 paths explaining all edges in the corresponding
graph.</p>
      <p>Proof. Consider a satisfying assignment. For each
variable x j, we will have a pair of paths, that will pass through
all cross gadgets corresponding to that variable in crossing
(in case that x j is true) or avoiding configuration (in case
that x j is false). These paths will explain all blue edges
except for those involving vertices u1 and u2, as well as all
green edges, since a particular path will consistently pass
each cross gadget corresponding to x j in the same way.
Moreover, these 2z paths will also explain at least one of
the three red edges at each level, since each clause must
have at least one satisfied literal. The remaining edges
(those that pass through u1 and u2, and at most two red
edges at each level corresponding to unsatisfied literals)
are easily explained using two additional paths. Therefore
2z + 2 paths suffice to explain all edges in the graph.
Lemma 2. If the 3CNF formula is not satisfiable,
explaining all edges in the corresponding graph requires at least
2z + 3 paths.</p>
      <p>Proof. Assume that we explain all edges by at most 2z + 2
paths. In order to explain the edges at zeroth level, we
need to have two paths that explain edges of the cross
gadget corresponding to each variable x j at level 0; the two
additional paths must be used to explain blue edges
connected to vertices u1 and u2 and consequently they cannot
be used to explain any green edges in the graph. Thus, we
only have two paths that are usable for explaining green
edges connecting to all the other cross gadgets
corresponding to variable x j. From this it follows that the two paths
must pass through all cross gadgets corresponding to x j in
crossing or avoiding formation, and consequently they can
be interpreted as an assignment of the truth values to the
variables. Since this assignment cannot satisfy the 3CNF
formula, there must be at least one level where all three
red edges remain unexplained. However, we only have
two paths (those passing through u1 and u2) that could be
used to explain those three red edges. This leads to a
contradiction and proves, that in such a case we require at least
2z + 3 paths.</p>
      <p>Together, these two lemmas have shown that the edges
of a graph constructed from the 3CNF formula can be
explained by 2z + 2 paths if and only if the formula is
satisfiable. In other words, we have shown by a reduction from
3-SAT:
Corollary 1 (NP-hardness). The problem of finding gene
variants is NP-hard.</p>
      <p>a
a
l
u
r b
m
o
f
e
h
t
oc
f
s
e
l
b
a
i
a d
r
V
e
s
3CNF formula:
( a ∨
¬c ∨
d ) ∧ ( ¬b ∨
d ∨
¬e ) ∧ ... ∧ ( ¬a ∨
b ∨ e )
In the previous section, we have shown that Problem 1
is NP-hard. Here, we propose that the problem can be
effectively solved by an integer linear programming (ILP)
solver.</p>
      <p>For a given graph of variants G(V, E), denote Vs ⊆ V set
of vertices without input edges (source vertices) and Vt ⊆
V set of vertices without output edges (sink vertices). We
will assume that the minimum number of paths explaining
all edges E is smaller than some number k; this can be set
for example as the number of edges in G, but in practice
much smaller values of k will suffice.</p>
      <p>We will have three types of 0-1 variables, each set of
variables replicated for each possible path. For path
number i (1 ≤ i ≤ k), variable xi,e = 1 if edge e is included
in the path; variable yi,e = 1 if edge e is explained by the
path; variable zi,v = 1 if vertex v is included in the path.</p>
      <p>Our goal is to minimize the number of paths, which can
be, for example, computed as the number of source
verk
tices used in the solution ∑i=1 ∑v∈Vs zi,v. This sum is
minimized subject to the following conditions:</p>
      <p>where idv is a unique identifier of vertex v. Conditions
(1)-(4) ensure consistency between assignments of X and
Z variables and at the same time, they enforce that the
object represented by the i-th group of variables is indeed a
path. Condition (5) determines explained edges and
condition (6) assures that each edge is explained by at least
one path. Finally, condition (7) helps to break
symmetries leading to many equivalent solutions (problems
having many symmetric solutions may be difficult to solve).</p>
      <p>This ILP has O((m + n)k) variables and conditions.
Variables xi,e fully determine the solution; the other
variables are auxiliary and can be uniquely determined from
xi,e. In the next section, we show how to reduce the
number of determining variables which will increase the
efficiency of the ILP in practice.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Reduced Graph of Variants</title>
      <p>In this section, we introduce an algorithm that will reduce
the number of edges in the graph of variants. In particular,
we will only be interested in the subset of edges that are
needed for the optimal solution.</p>
      <p>Definition 3 (Reduced graph of variants). Let
{p1, p2, . . . , pk} be the optimal set of paths
explaining all edges in the graph of variants. Reduced graph of
variants is a union of all edges of paths p1, . . . , pk. In case
of multiple optimal solutions we choose the smallest such
set of edges.</p>
      <p>Although computing the optimal set of paths is hard,
union of edges on these paths can be found by a simple
algorithm. A detour for an edge (vi, v j) is a path of length
at least two connecting vi and v j. If an edge e does not have
a detour, it can be explained only by paths contaning e, and
thus it must be included in the reduced graph. Conversely,
if e has a detour, any path in the optimal set can use edges
in the detour instead of e. Any optimal set of paths can be
thus transformed to contain only edges without a detour
and these edges form the reduced graph.</p>
      <p>To find the reduced graph, we number the vertices of the
variant graph v1, . . . vn from left to right. Let Mi be the set
of vertices, from which vertex vi is reachable. Algorithm
1 computes sets Mi for all vertices vi and simultaneously
selects edges for the reduced graph of variants.</p>
      <p>For each vertex vi, we compute set Mi by exploring
incoming edges (v j, vi) in the descending order by v j. If v j
is not already in Mi, then we add all vertices from M j to
Mi. In addition, there is no detour for (v j, vi), because such
a detour would have to enter vi by and edge (vk, vi) with
k &gt; j. Since set Mk for such k contains v j, v j would be
already in Mi when we encounter edge (v j, vi), which is
a contradiction. Therefore we select edge (v j, vi) for the
reduced graph. On the other hand, if v j is already in Mi, it
can be easily seen that a detour must exist for (v j, vi).</p>
      <p>The running time of this algorithm is O(n3) due set
manipulation on line 8. Instead of this simple algorithm, we
could also use faster transitive closure computation based
on fast matrix multiplication, but this algorithm is
sufficient for our purposes. Example of a reduced graph of
variants can be seen in Figure 4.</p>
      <p>Given a reduced graph of variants Gr(Vr, Er), we can
modify the ILP program by including variables xi,(u,v) only
An5a:5802 An4a1:5818
An5a:5802 An4a1:5818
Cn2a6:87265 Cn2a6:7480</p>
      <p>Cn7a:9828 Cn4a3:18434 Cn4a6:8176</p>
      <p>Cn2a6:87265 Cn2a6:7480</p>
      <p>Cn7a:9828 Cn4a3:18434 Cn4a6:8176
Gn4a8:48602 Gn4a8:78118
Gn4a8:48602 Gn4a8:78118
Tn2a7:97465 Tn2a9:77180</p>
      <p>Tn4a2:98628 Tn7a6:1834 Tn3a8:6876</p>
      <p>Tn2a7:97465 Tn2a9:77180</p>
      <p>Tn4a2:98628 Tn7a6:1834 Tn3a8:6876
for edges (u, v) ∈ Er, because the remaining edges will not
be used by the optimal set of paths. We will change the
conditions as follows:
We also add one more condition stating that each edge
from Er must be used on some path:</p>
      <p>k
∀(u, v) ∈ Er : ∑ xi,(u,v) ≥ 1</p>
      <p>i=1
6</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <p>(11)
(12)
(13)
(14)
We tested our approach on Illumina sequencing data from
yeast species Magnusiomyces magnusii (unpublished).
For out tests, we have selected FDH (formate
dehydrogenase) gene that exhibits variation in the number of copies
in related species (for example, two copies in S.
cerevisiae, three copies in C. albicans, and ten copies in Y.
lipolytica). The sequence assembly by assembly program
Velvet [Zerbino and Birney, 2008] contained several
fragments of FDH-like genes (likely originating from
different gene copies), but all of them were apparently
incomplete. We have manually assembled some of these
fragments and created a putative 1150 bp long template of
the FDH gene. By a rather complicated process using
blastx [Altschul et al., 1997], blat [Kent, 2002], and
Velvet [Zerbino and Birney, 2008], we have selected 74 694
read pairs similar to the FDH template. The average length
of a read is 100.8 bp, and the expected length of the gap
between the paired reads is 60.</p>
      <p>The individual reads were then aligned to the FDH
template by blat. Based on read alignments, we have
identified all variable positions and alternatives. To avoid
sequencing errors, we have discarded all alternatives that
accounted for less than 10% of reads at each variable
position. The ILP was solved by CPLEX optimization library,
the maximum number of paths was set to 30.
Algorithm 1 Algorithm for reducing the variant graph
1: Input: Variant graph G(V,E)
2: selectedEdges = {}
3: for i = 1 . . . n do
4: Mi = {vi}
5: for j = i − 1 . . . 1 ; (v j, vi) ∈ E do
6: if v j 6∈ Mi then
7: Mi = Mi ∪ M j
8: selectedEdges = selectedEdges ∪{(v j, vi)}
9: end if
10: end for
11: end for
12: return selectedEdges</p>
      <p>Figure 5 shows the result of our experiment. The
number of edges in the graph of variants can be significantly
reduced, thus speeding up the ILP. The algorithm has
discovered six variants of FDH gene in M. magnusii genome.
Our experiments shows that even though, in general, the
problem is NP-hard, it is possible to apply the ILP on the
real data.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we have introduced a problem of recovering
all variants of a given gene from unassembled next
generation sequencing data. We have formulated the problem
as a graph problem, shown that the problem is NP-hard,
and proposed a practical solution with the integer linear
program. We have shown that our solution can indeed be
applied to real data, and we have detected six variants of
the FDH gene in M. magnusii genome.</p>
      <p>There are several problems open for future work. In
this work, we have used pair-end reads, i.e. pairs of short
reads that are separated by a short gap (in this work, we
have assumed that the gap between the two paired reads
is shorter than the read length). On the other hand,
matepair sequencing produces paired reads that are at a much
longer distance. How do we incorporate such reads into
our framework?</p>
      <p>Here, we have considered variants where only a
single base has changed at a particular position. However,
other variants, such as short insertions or deletions, or even
larger scale changes, do occur in real data and they need
to be accounted for.</p>
      <p>One information that we ignored is the number of reads
overlapping each position of the template sequence. This
number is correlated with the number of copies of a
particular segment of the sequence. Such correlations can
be used to help to decide between alternative solutions
(each variant should have the same coverage over its whole
length).</p>
      <p>Finally, sequences flanking the template sequence on
each side will also differ between individual copies of a
. vertices of V are in topological order
. iterate through incoming edges in descending order
gene and may help to pinpoint the correct number of
variants.</p>
      <p>Acknowledgements. We thank prof. Jozef Nosek for
access to the unpublished sequencing data and for suggesting
FDH gene as an interesting target for study. This research
was supported by VEGA grant 1/1085/12 and by
Comenius University young researcher grant UK/425/2013.</p>
      <sec id="sec-6-1">
        <title>Brejová, T.</title>
      </sec>
      <sec id="sec-6-2">
        <title>Vinarˇ</title>
        <p>of FDH gene in M. magnusii genome. (top) The full graph of variants; only
alternatives accounting for
more than 10% of aligned genes were included. (middle) Reduced graph of variants significantly
reduces the number
of
edges. (bottom) Six variants
discovered by the ILP; each variant is shown in a different color.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Altschul et al.,
          <year>1997</year>
          ] Altschul,
          <string-name>
            <given-names>S. F.</given-names>
            ,
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            ,
            <surname>Schaffer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            , and
            <surname>Lipman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. J.</surname>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
          .
          <source>Nucleic Acids Res</source>
          ,
          <volume>25</volume>
          (
          <issue>17</issue>
          ):
          <fpage>3389</fpage>
          -
          <lpage>3392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Bankevich et al.,
          <year>2012</year>
          ] Bankevich,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Nurk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Antipov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Gurevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            ,
            <surname>Dvorkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Kulikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            ,
            <surname>Lesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            ,
            <surname>Nikolenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. I.</given-names>
            ,
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Prjibelski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            ,
            <surname>Pyshkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            ,
            <surname>Sirotkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            ,
            <surname>Vyahhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Alekseyev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            , and
            <surname>Pevzner</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. A.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
          .
          <source>J Comput Biol</source>
          ,
          <volume>19</volume>
          (
          <issue>5</issue>
          ):
          <fpage>455</fpage>
          -
          <lpage>457</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Ferragina and Manzini</source>
          , 2001] Ferragina,
          <string-name>
            <given-names>P.</given-names>
            and
            <surname>Manzini</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>An experimental study of an opportunistic index</article-title>
          .
          <source>In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete algorithms (SODA)</source>
          , pages
          <fpage>269</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Kent</source>
          , 2002] Kent,
          <string-name>
            <surname>W. J.</surname>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>BLAT-the BLAST-like alignment tool</article-title>
          .
          <source>Genome Res</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ):
          <fpage>656</fpage>
          -
          <lpage>664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Langmead et al.,
          <year>2009</year>
          ] Langmead,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Trapnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Pop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , and
            <surname>Salzberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. L.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
          .
          <source>Genome Biol</source>
          ,
          <volume>10</volume>
          (
          <issue>3</issue>
          ):
          <fpage>R25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al., 2008a]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Durbin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2008a</year>
          ).
          <article-title>Mapping short DNA sequencing reads and calling variants using mapping quality scores</article-title>
          .
          <source>Genome Res</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1851</fpage>
          -
          <lpage>1858</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al., 2008b]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kristiansen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2008b</year>
          ).
          <article-title>SOAP: short oligonucleotide alignment program</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>713</fpage>
          -
          <lpage>714</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>[Lin</surname>
          </string-name>
          et al.,
          <year>2008</year>
          ] Lin,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>M. Q.</given-names>
            ,
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , and
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>ZOOM! Zillions of oligos mapped</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>24</volume>
          (
          <issue>21</issue>
          ):
          <fpage>2431</fpage>
          -
          <lpage>2437</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Medvedev et al.,
          <year>2011</year>
          ] Medvedev,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Chaisson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Pevzner</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers</article-title>
          .
          <source>J Comput Biol</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1625</fpage>
          -
          <lpage>1634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Minoche et al.,
          <year>2011</year>
          ] Minoche,
          <string-name>
            <given-names>A. E.</given-names>
            ,
            <surname>Dohm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            , and
            <surname>Himmelbauer</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems</article-title>
          .
          <source>Genome Biol</source>
          ,
          <volume>12</volume>
          (
          <issue>11</issue>
          ):
          <fpage>R112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Nadalin et al.,
          <year>2012</year>
          ] Nadalin,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Vezzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            , and
            <surname>Policriti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>GapFiller: a de novo assembly approach to fill the gap within paired reads</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>13</volume>
          <issue>Suppl 14</issue>
          :
          <fpage>S8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Peterlongo and Chikhi</source>
          , 2012] Peterlongo,
          <string-name>
            <given-names>P.</given-names>
            and
            <surname>Chikhi</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>13</volume>
          :
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Pevzner et al.,
          <year>2001</year>
          ] Pevzner,
          <string-name>
            <given-names>P. A.</given-names>
            ,
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            , and
            <surname>Waterman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. S.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>An Eulerian path approach to DNA fragment assembly</article-title>
          .
          <source>Proc Natl Acad Sci U S A</source>
          ,
          <volume>98</volume>
          (
          <issue>17</issue>
          ):
          <fpage>9748</fpage>
          -
          <lpage>9753</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Pham et al.,
          <year>2013</year>
          ] Pham,
          <string-name>
            <given-names>S. K.</given-names>
            ,
            <surname>Antipov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Sirotkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Pevzner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            , and
            <surname>Alekseyev</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. A.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly</article-title>
          .
          <source>J Comput Biol</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>359</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Philippe et al.,
          <year>2011</year>
          ] Philippe,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Salson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lecroq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Leonard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Commes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            , and
            <surname>Rivals</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Querying large read collections in main memory: a versatile data structure</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          :
          <fpage>242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Warren and Holt</source>
          , 2011] Warren,
          <string-name>
            <given-names>R. L.</given-names>
            and
            <surname>Holt</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. A.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Targeted assembly of short sequence reads</article-title>
          .
          <source>PLoS One</source>
          ,
          <volume>6</volume>
          (
          <issue>5</issue>
          ):
          <fpage>e19816</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Zerbino and Birney</source>
          , 2008] Zerbino,
          <string-name>
            <given-names>D. R.</given-names>
            and
            <surname>Birney</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
          .
          <source>Genome Res</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>821</fpage>
          -
          <lpage>829</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Zhai et al.,
          <year>2012</year>
          ] Zhai,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Reinert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Waterman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            ,
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            , and
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Normal and compound poisson approximations for pattern occurrences in NGS reads</article-title>
          .
          <source>J Comput Biol</source>
          ,
          <volume>19</volume>
          (
          <issue>6</issue>
          ):
          <fpage>839</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>