<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DNA Sequence Segmentation Based on Local Similarity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martina Višnˇ ovská</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Vinarˇ</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇ a Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics, and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská Dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1003</volume>
      <fpage>36</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>DNA sequences evolve by local changes affecting one or several adjacent symbols, as well as by largescale rearrangements and duplications. This results in mosaic sequences with various degrees of similarity between regions within a single genome or in genomes of related organisms. Our goal is to segment DNA to regions and to assign such regions to classes so that regions within a single class are similar and there is low or no similarity between regions of different classes. We provide a formal definition of the segmentation problem, prove its NPhardness, and give a practical heuristic algorithm. We have implemented the algorithm and evaluated it on simulated data. Segments found by our algorithm can be used as markers in a wide range of evolutionary studies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper, we study the problem of sequence
segmentation arising in computational biology. We are given
a string over a finite alphabet, and our task is to identify
non-overlapping segments in this string and to partition
these segments into classes so that segments within each
class are similar to each other, and there are no
significant similarities between segments from different classes.
We will call the resulting segments atomic segments or
simply atoms. This task is somewhat similar to clustering,
where we also aim to split input data into classes (clusters)
so that the similarity between items in the same cluster
is higher than the similarity between items from different
clusters. The main difference is that in clustering, we are
already given a fixed set of items to partition into clusters,
whereas in our problem, the atomic segments can be
located anywhere along the input string, and thus we need to
simultaneously look for atoms themselves, as well as their
clustering.</p>
      <p>This problem arises in comparative genomics, where
we compare genomes (DNA sequences) of related species.
These sequences have evolved from a DNA sequence of
their common ancestor by a series of mutations, which
include local changes affecting one or several adjacent
symbols, as well as by large-scale events, such as
rearrangements and duplications. This results in mosaic sequences
with various degree of similarity between regions within
a single genome or in genomes of related organisms.</p>
      <p>
        Small-scale and large-scale evolutionary changes are
typically studied separately. Substitutions and short
insertions and deletions can be described by various
wellstudied probabilistic models of evolution
        <xref ref-type="bibr" rid="ref5">(Felsenstein,
2004)</xref>
        . Application of these models typically starts with
a set of homologous sequences, that is, sequences that have
evolved from a common ancestor by local changes. These
homologous sequences can represent atoms from one class
in our segmentation. Studies of large-scale events are more
often based on the parsimony principle, where the goal is
to find an evolutionary history leading to present-day
sequences, which contains the smallest possible number of
large-scale evolutionary events, such as rearrangements,
duplications, or deletions
        <xref ref-type="bibr" rid="ref6">(Fertin et al., 2009)</xref>
        . These
studies typically represent the genome as a sequence of
markers, and again, each atomic segment in our segmentation
can be used as a marker in this context.
      </p>
      <p>
        One frequently used method for obtaining segments or
markers for both types of evolutionary analyses is to use
genes, that is, portions of the genome encoding functional
protein or RNA molecules. However, these approaches
ignore information present in the intergenic regions of
the genomes, and are sensitive to errors in gene
annotation. In our work, we define atomic segments by
considering only information about local sequence similarities
between parts of the input sequences. Such similarities
can be readily found by one of many sequence alignment
programs; in our work we use lastz
        <xref ref-type="bibr" rid="ref9">(Harris, 2007)</xref>
        . In our
previous work
        <xref ref-type="bibr" rid="ref2">(Brejova et al., 2011)</xref>
        , we have informally
defined the goals of the segmentation problem and
provided a practical heuristic algorithm attempting to cover
the whole sequence by atoms. However, we were not able
to design a suitable formalization of the problem. Here, we
provide a formal definition of the problem as an
optimization problem, prove its NP-hardness, and provide a new
heuristic algorithm. The main difference from our
previous goal is that we do not require that the whole sequence
is covered by atoms; some problematic parts may be left
out. We have implemented the algorithm and evaluated it
on simulated data.
      </p>
      <p>
        Our algorithm can provide fine-scale segmentations,
with atom lengths in hundreds or thousands, in contrast
to some earlier approaches concentrating on much longer
segments. Multiple authors have for example studied
methods for constructing so called synteny blocks from
conserved chains of genes in two or multiple genomes
        <xref ref-type="bibr" rid="ref4 ref8">(Choi et al., 2007; Hachiya et al., 2009)</xref>
        ; similar approach
was used already by
        <xref ref-type="bibr" rid="ref10">Nadeau and Taylor (1984)</xref>
        . Our work
is also related to the area of multiple sequence alignment,
where the goal is to arrange k sequences into a matrix with
k rows by padding them with special gap symbols as
necessary so that each column contains homologous symbols.
In the presence of rearrangements and duplications, such
linear representation is not possible, and thus multiple
alignment programs split the genome into smaller blocks
and align symbols in each block separately
        <xref ref-type="bibr" rid="ref1 ref11 ref3">(Blanchette
et al., 2004; Brudno et al., 2003; Ovcharenko et al., 2005)</xref>
        .
These blocks can be also considered atomic segments in
our setting, but most multiple alignment algorithms do not
consider duplications and concentrate on recovering
correct alignment within individual blocks rather than on
optimal division of sequence into blocks.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Problem Definition</title>
      <p>The main goal of this section is a formal definition of
the segmentation problem as a combinatorial optimization
problem. The input will be a set of local similarities
between pairs of regions in the string, which we wish to
segment. In the rest of the paper, we will discuss
segmentation of a single string; in case that we have several strings,
such as DNA sequences from related species, we will first
concatenate them to one string.</p>
      <sec id="sec-2-1">
        <title>True segmentation. We start by defining the true seg</title>
        <p>mentation for a string with a known evolutionary history.
In this history, we ignore all substitutions, that is
operations changing one symbol into another, and consider
only events that insert, delete, move or duplicate parts of
the string. Each such operation can be viewed as
cutting the original string at some places known as
breakpoints, adding, discarding, copying or changing the
order of the pieces, and finally joining the pieces into one
string. The true segmentation will use these breakpoints as
atom boundaries, and will group to one class those atoms
that have evolved by duplication from a common ancestor.
In our simplified abstract view, DNA of multiple species
also arose by duplication of the whole sequence from one
species to another. However, if a breakpoint is created in a
part of the string which has multiple copies elsewhere, we
will also propagate this breakpoint to all these copies (see
Figure 1). This ensures that if we take two atoms, they
are either homologous (evolutionarily related) across their
whole length, or not homologous at all.</p>
        <p>For real biological sequences, the evolutionary history
and the true segmentation are not known. Indeed, our
motivation for segmenting the sequence is to use the result
as an input for evolutionary history inference. We want
to formulate an optimization problem, which will allow
us to recover the true segmentation or its approximation
based on information about sequence similarity. The
resulting algorithms can be tested on sequences generated
by a probabilistic model of evolution, where we know the
true segmentation. In practice, we do not want to create
an atom boundary at the position of each short insertion or
deletion, because these events are quite frequent. We will
thus only consider breakpoints created by longer events.
As we will see, our problem uses a parameter L for
minX :
Y :
Z:</p>
      </sec>
      <sec id="sec-2-2">
        <title>ABCDEF G</title>
      </sec>
      <sec id="sec-2-3">
        <title>ABCDEB0C0D0F G</title>
      </sec>
      <sec id="sec-2-4">
        <title>ABCDEB0D0FC0G</title>
        <p>imum atom length related to the granularity at which we
consider an event to be large-scale.</p>
        <p>Input data and notation. We now formally describe
input data for our computational problem and introduce
useful notation. As an input data for segmentation, we have
DNA sequence S = s1s2 . . . sn and a set α containing
alignments between pairs of similar regions within S. Ideally,
these alignments would reflect true homology, but in
reality, some alignments might be spurious, missing, or locally
incorrect. We will define an alignment more formally
below.</p>
        <p>We will frequently refer to positions and regions
within S. To do so, we number the spaces between consecutive
symbols of S so that the space between si and si+1 is
labeled i, space to the left of s1 is labeled zero, and space to
the right of sn is labeled n. By S(i, j) we denote the
contiguous region of S starting in the space labeled by i and
ending in the space labeled by j, where 0 ≤ i ≤ j ≤ n. In
this notation, we can refer to an empty region at specific
position i in S by S(i, i) (see Figure 2). We will consider
regions S(i, j) as open intervals. Region S(i, j) then
overlaps S(k, `), if the intersection of intervals (i, j) and (k, `)
is non-empty. We will say that S(i, j) covers S(k, `) if
interval (k, `) is a subset of interval (i, j).</p>
        <p>Recall that an alignment is a way of arranging two (or
more) sequences by inserting a special gap symbol at some
locations, so that both sequences have the same length
(Figure 3). It is typically used to represent significant
sequence similarities: the two aligned sequences are
assumed to be homologous and symbols in the same column
sequence T1:
sequence T2:</p>
        <p>A T C - - G C G G A</p>
        <p>A T C A A G T G - A
of the alignment are also hypothesized to share
evolutionary origin. To define the segmentation problem, we
represent an alignment as a relation between positions of two
sequences.</p>
        <sec id="sec-2-4-1">
          <title>Definition 1. An alignment a with source S(i, j) and des</title>
          <p>tination S(k, `) is a partial mapping from {i + 1, . . . , j} to
{k + 1, . . . , `}, such that a(i + 1) = k + 1, a( j) = `, and
if for some f &lt; g both a( f ) and a(g) are defined, then
a( f ) &lt; a(g).</p>
          <p>Informally, mapping t = a( f ) corresponds to the
situation when symbols s f and st are aligned in one column.
Note that we require that the first and the last symbols of
the two aligned regions map to each other (the alignment
does not have a gap symbol in the first or the last column).</p>
          <p>We will extend the notion of mapping from individual
symbols to whole regions. Let a be an alignment between
regions T1 and T2, and let U be a region such that T1 covers
U . Then, the mapping of region U through alignment a
is the shortest region within T2 that contains mapping of
every symbol from U , for which such mapping exists. We
will denote this regions as a(U ).</p>
          <p>Note that our definition of an alignment is asymmetric.
Typically for every alignment a mapping T1 to T2, the input
will also contain its reverse counterpart a0 mapping T2 to
T1 so that a(x) = y if and only if a0(y) = x.</p>
          <p>Problem definition. We are now ready to formally define
our computational problem.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Definition 2. A segmentation of sequence S with respect</title>
          <p>to a set of alignments α and a length parameter L is a set
of regions A (termed atoms) for which the following
conditions hold:</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>A1) No two atoms from set A overlap.</title>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>A2) The length of each atom is at least L.</title>
        <p>A3) If a source or a destination of an alignment a ∈ α
overlaps some atom A, it also covers A.</p>
        <p>A4) If the source of alignment a ∈ α covers some atom A,
then the region a(A) overlaps with exactly one atom
from A .</p>
        <p>Condition A3 ensures that no atom contains
boundaries of any alignment. This is desirable, because
alignment boundaries intuitively correspond to breakpoints in
the evolutionary history.</p>
        <p>If sequence S contains several homologous copies of
a region d, we wish to avoid the situation where one copy
of d forms an atom by itself and another copy is only
a small part of a longer atom. This is enforced by
condition A4, which implies that atom A can be aligned to
a smaller part within another atom B, or to some region
which contains not only atom B, but also some
surrounding symbols. The second case is possible if the symbols
surrounding B are not covered by any other atom of A .
We call such areas waste regions.</p>
        <sec id="sec-2-5-1">
          <title>Definition 3. Let A1, . . . , Ap be the atoms of a segmenta</title>
          <p>tion A ordered by their position in S, where Ai = S( fi, ti).</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>The waste region between atoms Ai and Ai+1 is the re</title>
          <p>gion S(ti, fi+1). Two additional waste regions S(0, f1) and</p>
        </sec>
        <sec id="sec-2-5-3">
          <title>S(tp, n) are located at sequence ends.</title>
          <p>Depending on a situation, we can specify a
segmentation either by its atoms, or by the waste regions between
the atoms. Alignments, which cover the same region of S
usually do not have boundaries at identical positions,
because it is difficult to detect true homology boundaries in
local alignment search. As the true boundary has an
uncertain position, we consider the region containing several
alignment boundaries close to each other to be a waste
region.</p>
          <p>To partition atoms to classes, we will represent
alignments in the form of a graph. In the alignment graph of
segmentation A , each vertex corresponds to an atom of A
and two vertices A and B are connected by an edge if some
alignment maps A to a region overlapping B. Then each
atomic class of A corresponds to one connected
component of the alignment graph.</p>
          <p>To choose the best segmentation, we introduce a cost
function. As we prefer to have a high coverage of input
sequence S by atoms, the cost function penalizes a
segmentation for every symbol located in a waste region. We
also prefer a segmentation with a lower number of atoms
to avoid unnecessary fragmentation. Therefore we
penalize a segmentation for the number of atoms, but this
second penalty has a low weight ε = 1/n, where n is the
length of S. For simplicity, our cost function does not
take atom distribution to classes into account. Thus a
segmentation A with p atoms of total length r has cost
c(A ) = (n − r) + ε p.</p>
          <p>We can now approach the sequence segmentation
problem as an optimization problem of searching for the
minimum cost segmentation for a given sequence S, set of
alignments α , and length parameter L.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>NP-Hardness of the Segmentation</title>
    </sec>
    <sec id="sec-4">
      <title>Problem</title>
      <p>In this section, we will prove NP-hardness of the
segmentation problem. We will consider a decision version, in
which we are given a threshold B, and we ask if there is a
segmentation with cost at most B. It is easy to see that this
problem is in NP, because we can non-deterministically
guess a set of atoms and calculate its cost, as well as
verify the requirements of Definition 2 in polynomial time.</p>
      <p>
        We will prove NP-hardness of this problem by a
reduction from the classical one-in-three 3SAT problem
        <xref ref-type="bibr" rid="ref7">(Garey
and Johnson, 1979)</xref>
        . The instance of this problem
consists of a set U of variables, and a collection C of clauses,
each clause containing three literals. The problem is to
determine, if there is a truth assignment for U such that
each clause from C contains exactly one true literal. Our
goal is to transform such an instance (U,C) to an instance
(S, α , L, B) of the segmentation problem.
      </p>
      <p>In our construction, we will for simplicity create many
short sequences and alignments between these sequences
or their parts. To get one input sequence S, we can
concatenate these short sequences and adapt the created
alignments so that they refer to the corresponding parts of S.
Length threshold L can be an arbitrary constant greater
than two.</p>
      <p>The basic building block of our proof is a gadget
consisting of k + 3 sequences S0, . . . , Sk+2, all of the same
length m. Sequence S0 will be called the interface of the
gadget, because its parts align to sequences outside this
set, but sequences S1, . . . Sk+2 align only with other
sequences within the gadget. In particular, for every i and
j such that 0 ≤ i &lt; j ≤ k + 2, we create an alignment ai, j.
This alignment maps the first and the last symbol of Si to
the first and the last symbol of S j, respectively. It also
maps the r-th symbol of sequence Si to the (r + 1)-st
symbol of sequence S j, for 2 ≤ r ≤ m − 2. The second symbol
of Si and the (m − 1)-st symbol of S j remain unaligned in
this alignment. For every alignment ai, j, we also create its
reverse a j,i mapping S j to Si.</p>
      <p>The goal of this gadget is to ensure that the interface
sequence S0 will contain at most one atom; otherwise the
cost of the segmentation will be higher than B. This
follows from the following lemma if we set k = dBe + 1 (the
value of B will be determined later).</p>
      <sec id="sec-4-1">
        <title>Lemma 1. If sequence S0 contains at least two different</title>
        <p>atoms, then the gadget S0, . . . , Sk+2 contains at least k
symbols in waste regions.</p>
        <p>Proof. Let us assume that S0 contains at least two different
atoms, but the overall number of symbols in waste regions
is at most k − 1. This implies that at least three sequences
Si, S j and S` for 1 ≤ i &lt; j &lt; ` ≤ k + 2 do not contain any
wasted symbols. However, each of these sequences aligns
to S0, and since S0 contains at least two atoms, each of
these sequences needs to be split into at least two atoms
due to condition A4 of Definition 2. These atoms need
to be separated by a waste region of length 0. Let us
assume that Si(x, x) is such a waste region of length 0 in Si.
Symbols at positions x + 1 and x + 2 in S j align to
symbols at positions x and x + 1 in Si through a j,i, and thus
they cannot occur in the same atom in S j. This means that
S j(x + 1, x + 1) is a waste region as well. By the same
reasoning, S`(x + 1, x + 1) is a waste region due to
alignment a`,i and S`(x + 2, x + 2) is a waste region due to
alignment a`, j mapping symbols x + 2 and x + 3 to two
different atoms in S j. However, this is a contradiction, because
waste regions S`(x + 1, x + 1) and S`(x + 2, x + 2) create an
atom S`(x + 1, x + 2) of length one, and each atom needs
to have a length at least L &gt; 2. As S`(x + 1, x + 2) cannot
be an atom, it has to be a waste region, which is a
contradiction. Therefore the set S1, . . . , Sk+2 contains at least
k sequences containing at least one wasted symbol each,
and the total number of wasted symbols in the gadget is at
least k.</p>
        <p>Our construction will contain for each variable y ∈ U
two sequences Xy and X¬y, each of length L. It will also
contain a gadget Uy as described above, with each
sequence of the gadget having length 2L. Finally, for each
clause ci ∈ C we create a gadget Ci containing sequences
of length 3L. In addition to the alignments within gadgets,
we will add gapless alignments between sequences Xy and
parts of interfaces of gadgets Uy and Ci. We say that an
alignment is gapless, if it aligns regions T1 and T2 of the
same length and maps the r-th symbol of T1 to the r-th
symbol of T2. In particular, we will align sequence Xy by a
gapless alignment to the first half of the interface sequence
of Uy and we will align sequence X¬y to the second half of
this interface. For a clause ci = `1 ∨ `2 ∨ `3 we align X`1 to
the first third of the interface forCi and similarly align X`2
to the second third and X`3 to the last third of the interface.
All these gapless alignments are created in both directions.
Finally, we set B = 2L(|U | + |C|) + 1/2.</p>
        <p>An example of the construction is shown in Figure 4.
Clearly the construction is polynomial in terms of |U | +
|C|, and its correctness is summarized in the following
theorem.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Theorem 1. One-in-three 3SAT instance (U,C) is satisfi</title>
        <p>able if and only if the corresponding instance of the
segmentation problem has a solution with cost at most B.
Proof. Let us first assume that instance(U,C) is satisfied
by a truth assignment t : U → {T, F}. We will create a
segmentation A with cost c(A ) ≤ B.</p>
        <p>Consider sequences Xy and X¬y representing the
positive and negative literal of variable y ∈ U . One of these
sequences corresponds to the satisfied literal in the truth
assignment t; this sequence will be included as an atom
in A . The other sequence will be completely covered by
a waste region.</p>
        <p>Each sequence in each gadget Uy and Ci will contain
one atom. Each non-interface sequence will be completely
covered by its atom. Interface sequences will contain one
atom of length L in the region aligned to sequence X` for
the satisfied literal ` and waste region elsewhere. Note
that there is always exactly one satisfied literal ` for each
variable y and for each clause ci.</p>
        <p>It is easy to see that this segmentation satisfies all
conditions. The total number of wasted symbols in A is
2L(|U | + |C|) = B − 1/2. Since each atom has length at
least L &gt; 2, the total number of atoms is less than n/2,
where n is the total length of the concatenated sequence
for the segmentation problem. Therefore the contribution
U
a</p>
        <p>U
b</p>
        <p>U
c</p>
        <p>U
d
of the atom number to the total cost of A is less than
ε · n/2 = 1/2 and the total cost c(A ) is at most B.</p>
        <p>Now assume that we have a segmentation A with cost
at most B, and we will prove that then the instance (U,C)
of one-in-three 3SAT is satisfiable.</p>
        <p>Each interface sequence has alignment endpoint
every L symbols, and thus each atom in this sequence can
have length of at most L. However, by Lemma 1, each
gadget can have at most one atom in its interface. The
remaining regions of length L within interface will be
therefore waste. If some interface did not contain any atom,
then all sequences in the corresponding gadget will have
to be covered with waste as well, raising the cost above B.
This implies that there will be exactly one atom in each
interface.</p>
        <p>Let Ay be the atom within interface of Uy. This atom
is aligned to sequence X`, where ` is y or ¬y. We will
assign truth value t(y) so that literal ` is satisfied. Since
atom Ay aligns to X`, and X` has length L, it must be an
atom as well. Similarly, X¬` aligns to a waste region in the
interface of Uy, and thus X¬` is also covered with waste.
Therefore, t(`) is true for some literal ` if sequence X` is
an atom, and t(`) is false if X` is a waste region.</p>
        <p>Interface of Ci also contains exactly one atom Bi and
two waste regions of length L. Each of these three thirds
aligns to some X`. The alignment for the atom must map
to another atom, and an alignment for a waste region
must map to a waste region. Therefore exactly one of the
three literals in clause ci must have its corresponding
sequence X` covered by an atom, and this means that exactly
one literal of the clause is satisfied by t. We have thus
proved that the assignment t satisfies the corresponding
one-in-three 3SAT instance (U,C).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Practical Algorithm for Sequence</title>
    </sec>
    <sec id="sec-6">
      <title>Segmentation</title>
      <p>Since the segmentation problem is NP-hard, we have
designed a heuristic algorithm for practical use. This
algorithm creates a segmentation satisfying Definition 2, but
possibly not the optimal one.</p>
      <p>
        Our algorithm gets as an input a set of evolutionarily
related sequences or a single sequence which contains
duplicated segments. We use the LASTZ program
        <xref ref-type="bibr" rid="ref9">(Harris,
2007)</xref>
        to align every sequence to every other and also to
itself. We use the same alignment preprocessing as in our
earlier work
        <xref ref-type="bibr" rid="ref2">(Brejova et al., 2011)</xref>
        to discard weak
alignments or their parts.
      </p>
      <p>Now we create initial waste regions of zero length at
both ends of each alignment, as our definition requires
that there are no alignment boundaries inside atoms. If
any consecutive waste regions are closer to each other
than L, they are joined to a single waste region, because
no atom can be located between them. The resulting set
of proto-atoms located between successive waste regions
satisfies conditions A1, A2, and A3 of Definition 2. It
is possible, though, that condition A4 is not satisfied, if
some proto-atom maps to a region overlapping two or
more other proto-atoms, or to a region completely covered
by waste (see Figure 5). Situations similar to the
example in Figure 5 can be caused by the choice of similarity
threshold, which we use to filter alignments. If alignments
between pairs of atoms in some class are close to the
similarity threshold, some of the alignments within the class
pass the threshold, whereas others do not.</p>
      <p>Our algorithm iteratively splits proto-atoms or extends
existing waste regions until we get a proper segmentation.
In a situation depicted in Figure 5, where a single
protoatom A1 + B1 maps to two proto-atoms A2 and B2, we
typically want to split the proto-atom A1 + B1 into two.
However, if atoms A2 and B2 are separated by a longer waste
region, there could be several possible places, where A1 + B1
could be split. One option would be to choose
arbitrarily one of those places. However, this choice might lead
to higher cost once we consider additional alignments.
Instead, we collect such splitting requirements from all
alignments covering the currently studied proto-atom and
choose the optimal set of new waste regions to satisfy all
of them. Instead of splitting the proto-atom, we may also
expand the waste region at one of its ends.</p>
      <p>We call our algorithm IMP (Inverse Mapping to a
Protoatom). In each step, we choose a proto-atom P, and
consider each alignment a whose source covers P. We will
map P through a and consider each waste region w within
region a(P). We will map w back to atom P by the
alignment a0 reverse to a, obtaining region w0. Let W be the
set of all such waste regions inversely mapped to P. We
will create a new set of waste regions Wnew, which has to
satisfy the following requirements:
C1) Each region in W overlaps with some waste region
from Wnew.</p>
      <p>C2) No region between two successive waste regions
from Wnew is completely covered by a region from W .
C3) Every region between two successive waste regions
from Wnew has length at least L.</p>
      <p>C4) The new set Wnew has the lowest cost.</p>
      <p>We apply this process to proto-atoms until no further
changes are made, and thus the resulting segmentation
satisfies conditions of Definition 2.</p>
      <p>We construct the optimal set Wnew for a proto-atom
P = p1 p2 . . . pm by dynamic programming. For each
prefix P(0, i) = p1 . . . pi we compute the cost of the optimal
waste region set with space labeled i being part of the last
waste region; we denote this quantity as cost(i). In
computing this cost, we consider all regions from W
overlapping P(0, i).</p>
      <p>We will first describe details of dynamic programming
for a simpler case when the set W does not contain any
pair of regions such that one of them covers the other. The
algorithm computes cost(i) only for those positions i that
are covered by at least one region from W . Suppose that i
is covered by a region from W and let P( j, k) be the
rightmost region in W such that k &lt; i. Since there is a waste
region at the space labeled i, all regions from W
overlapping i have condition C1 satisfied. However, condition C1
needs to be satisfied also forP( j, k), and therefore we have
cost(i) = min c(`, i),</p>
      <p>j≤`≤k
where value c(`, i) is the cost of the optimal set of new
waste regions for prefixP(0, i) given that both space i and
space ` are part of waste regions. Value c(`, i) consists of
the cost for prefixP(0, `) and the cost added by enforcing
a waste region at position i.</p>
      <p>cost(`) + (i − l) if i − ` ≤ L or

c(`, i) = 
cost(`) + ε
∃w ∈ W : P(i, `) ⊆ w
otherwise.</p>
      <p>In the first case, the waste regions containingi and ` are
joined together. Otherwise, we would create a new
protoatom which would be too short or which would align to a
waste region. In the second case of the formula, we create
a new proto-atom P(`, i) and add ε to the cost.</p>
      <p>If we include auxiliary region P(m, m) in the set W , we
will obtain the final cost of the optimal solution as cost(m).
During computation, we store for each i also position ` for
which c(`, i) achieves the minimum. We use these stored
positions to trace back the whole optimal solution Wnew.</p>
      <p>If the set of regions W contains some regions that cover
other regions, we split the set into two parts. Set C will
consists of the regions from W , which cover any other
region of W , and set W 0 will contain the remaining
regions. Clearly, if each region from W 0 overlaps some
region from Wnew, then also each region from C overlaps
some region from Wnew. We will therefore run the
algorithm, considering only regions from W 0 for finding
positions i and `, but using the whole set W to determine if
P(i, `) is a subset of some w ∈ W in computation of c(i, `).</p>
      <p>The dynamic programming algorithm runs in time
O(m2) on a proto-atom of length m. We repeat this
algorithm for all proto-atoms until there are no further changes.
Clearly, the number of such iterations is polynomial, since
in each iteration we either detect that no further changes
were made for any atom, or we increase either the number
of atoms or the number of wasted bases at least by one. In
practice, the number of iterations is quite small, up to four
in the experiments on simulated data described in the next
section.</p>
      <p>Note that this algorithm does not give optimal results,
because it always tries to ensure condition A4 of
Definition 2 by modifying the current atom P, but it might be
better to instead modify destinations of alignments with
source in P.</p>
      <p>
        Our new IMP algorithm is somewhat similar to the
original IHM algorithm introduced in
        <xref ref-type="bibr" rid="ref2">Brejova et al. (2011)</xref>
        in
the sense that we repeatedly map positions through input
alignments and modify the segmentation, until no more
changes are necessary. Unlike IMP, which maps whole
waste regions, the IHM algorithm maps individual
alignment endpoints and then modifies their location so that
at most one endpoint is located within each window of
size W . While our new algorithm produces a
segmentation conforming to Definition 2, no such formal definition
exists for IHM.
5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Experimental Evaluation</title>
      <p>
        In this section, we evaluate the proposed algorithm on
simulated sequences and compare the accuracy of its
results with our earlier algorithm IHM
        <xref ref-type="bibr" rid="ref2">(Brejova et al., 2011)</xref>
        .
Simulated data have the advantage that we know their
true segmentation, and thus we can easily evaluate
performance of the algorithms.
      </p>
      <p>
        To evaluate a segmentation, we first computereciprocal
best matches (BRM) and boundary fitting matches (BFM)
for its atoms. Suppose that atom A belongs to the true
and atom B to the predicted segmentation. Atoms A and B
are BRM to each other, if they overlap and no other atom
overlaps with any of them by a larger or equal amount than
A and B overlap each other. Atoms A and B are BFM to
each other, if their start positions differ by at most some
amount k and likewise their end positions. As threshold
k we choose some fraction of the minimal atom length L,
in the experiments we use dL/4e. Note that at this
setting, each atom has at most one BRM match and at most
one BFM match, and that a BFM match, if it exists, is
automatically also a BRM match. BRM-based accuracy
measures were used already in
        <xref ref-type="bibr" rid="ref2">Brejova et al. (2011)</xref>
        ; we
propose BFM-based approach to measure if atom
boundaries are approximately correct.
      </p>
      <p>We use BRM and BFM to evaluate sensitivity and
specificity of the predicted segmentation. In particular, letb be
the number of BRM pairs, c be the number of BFM pairs,
p be the number of predicted atoms, and t be the number
of true atoms. Then BRM specificity is b/p, BRM
sensitivity is b/t, BFM specificity isc/p, and BFM sensitivity
is c/t.</p>
      <p>The BRM specificity decreases for example when
several predicted atoms overlap one true atom. The true atom
creates a BRM pair with only one of the predicted atoms,
while all the predicted atoms are included in p. In this
situation, a BFM pair does not have to exist for the true atom,
which causes even more significant decrease in the BFM
specificity. Conversely, if a predicted atom covers several
true atoms, we observe a decrease in BRM and BFM
sensitivity.</p>
      <p>We also evaluate correctness of the grouping of the
predicted atoms into classes. Let A be a class of true atoms
and B be a class of predicted atoms. We consider B to be
the correct prediction of A, if each atom of B has its
BRM/BFM pair in A and each atom of A has its BRM/BFM
pair in B. Let d be the number of correctly predicted
classes according to BRM, e be the number of correctly
predicted classes according to BFM, p be the number of
predicted classes, and t be the number of true classes.
Then BRM class specificity is d/p, BRM class
sensitivity is d/t, BFM class specificity is e/p, and BFM class
sensitivity is e/t.</p>
      <p>
        We have evaluated performance of the algorithms on ten
simulated data sets taken from
        <xref ref-type="bibr" rid="ref2">Brejova et al. (2011)</xref>
        . The
sets were produced by simulation of sequence evolution,
allowing substitutions, short insertions and deletions, as
well as large-scale deletions and duplications.
      </p>
      <p>
        Unlike our algorithm, which introduces waste regions
not covered by any atoms, the IHM algorithm covers the
whole sequence by atoms of length at least W . In regions
with many alignment boundaries this leads to many short
atoms with lengths approximately W . To make the IHM
comparable with the IMP algorithm, we have used W =
dL/4e, and we have subsequently filtered out all classes
containing atoms shorter than the minimal required length
L. Similar filtering was also used in
        <xref ref-type="bibr" rid="ref2">Brejova et al. (2011)</xref>
        .
      </p>
      <p>Table 1 show the comparison of IMP and IHM on the
simulated data with four different minimal atom lengths
L ∈ {50, 100, 250, 500}. Column TRUE of the table shows
the results for the true segmentation from which the atoms
shorter than L have been discarded. This is an upper bound
on the accuracy of any segmentation algorithm which does
not overestimate lengths of atoms.</p>
      <p>Both algorithms cover almost whole sequence by atoms
and have very high BRM sensitivity and class sensitivity,
close to the upper bound given by TRUE. Sensitivity of the
IHM algorithm is slightly higher than sensitivity of IMP.
However, IMP is much more specific for smaller values
of L. For higher L the task becomes easier and both
algorithm achieve good specificity. Under the BFM measures,
the new algorithm is both more specific and sensitive with
the exception of L = 500, where IHM slightly outperforms
IMP. Overall, IMP performs better, particularly at short
atoms lengths, but there is still room for improvement in
specificity and correct prediction of atom boundaries.
measure
coverage
sp
M sn
BR class sp
class sn
sp
M sn
FB class sp
class sn</p>
      <p>L = 50</p>
      <p>IMP IHM TRUE
6</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>We have studied a sequence segmentation problem arising
in comparative analysis of related DNA sequences. We
have defined the problem formally, proved its NP
hardness and provided a practical heuristic algorithm, which
uses dynamic programming to select an optimal
combination of several local changes. We have implemented and
evaluated our algorithm on simulated data to show that our
problem definition and algorithmic approach lead to
reasonable results with respect to a simple model of
evolution generating our data. Our tool can be used to prepare
data for numerous algorithms operating on sets of markers
extracted from multiple DNA sequences.</p>
      <p>This area offers both practical and theoretical questions
for further study. From a theoretical point of view, it would
be interesting to study approximation or fixed-parameter
algorithms for our problem. From a practical point of
view, more detailed evaluation on both simulated and real
biological data could discover weak points of our
algorithm and to help to improve it further.</p>
      <p>Acknowledgments. This research was supported by VEGA
grant 1/1085/12.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Blanchette</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kent</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riemer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elnitski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smit</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roskin</surname>
            ,
            <given-names>K. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baertsch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenbloom</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clawson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          , et al. (
          <year>2004</year>
          ).
          <article-title>Aligning multiple genomic sequences with the threaded blockset aligner</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ):
          <fpage>708</fpage>
          -
          <lpage>715</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Brejova</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vinar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Automated segmentation of DNA sequences with complex evolutionary histories</article-title>
          . In Algorithms in Bioinformatics, 11th International Workshop (WABI), volume
          <volume>6833</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brudno</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malade</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poliakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couronne</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubchak</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Batzoglou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Glocal alignment: finding rearrangements during alignment</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>i54</fpage>
          -
          <lpage>i62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Algorithms for the extraction of synteny blocks from comparative maps</article-title>
          .
          <source>In Algorithms in Bioinformatics (WABI)</source>
          , volume
          <volume>4645</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>277</fpage>
          -
          <lpage>288</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Felsenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Inferring phylogenies</article-title>
          . Sinauer Associates.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Fertin</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labarre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vialette</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Combinatorics of genome rearrangements</article-title>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Garey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1979</year>
          ).
          <article-title>Computers and Intractability: A Guide to the Theory of NPCompleteness</article-title>
          . W. H. Freeman &amp; Co.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Hachiya</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osana</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popendorf</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sakakibara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Accurate identification of orthologous segments among multiple genomes</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>25</volume>
          (
          <issue>7</issue>
          ):
          <fpage>853</fpage>
          -
          <lpage>860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Improved pairwise alignment of genomic DNA</article-title>
          .
          <source>PhD thesis</source>
          , Pennsylvania State University.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          (
          <year>1984</year>
          ).
          <article-title>Lengths of chromosomal segments conserved since divergence of man and mouse</article-title>
          .
          <source>Proceedings of the National Academy of Science USA</source>
          ,
          <volume>81</volume>
          (
          <issue>3</issue>
          ):
          <fpage>814</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Ovcharenko</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loots</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          , amd Minmei Hou,
          <string-name>
            <given-names>B. M. G.</given-names>
            , Ma, J.,
            <surname>Hardison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            ,
            <surname>Stubbs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , and
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Mulan: Multiple-sequence local alignment and visualization for studying function and evolution</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>184</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>