<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Improved Algorithm for Ancestral Gene Order Reconstruction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Albert Herencsár</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇ a Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics, and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská Dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>1214</volume>
      <fpage>46</fpage>
      <lpage>53</lpage>
      <abstract>
        <p>Genome rearrangements are large-scale mutations that change the order and orientation of genes in genomes. In the small phylogeny problem, we are given gene orders in several current species and a phylogenetic tree representing their evolutionary history. Our goal is to reconstruct gene orders in the ancestral species, while minimizing the overall number of rearrangement operations that had to occur during the evolution. The small phylogeny problem is NP-hard for most genome rearrangement models. We have designed a heuristic method for solving this problem, building on ideas from an earlier tool PIVO by Kovácˇ et al 2011. Our tool, PIVO2, contains several improvements, including randomization to select among potentially many equally good candidates in a hill-climbing search and a more efficient calculation of distances between gene orders. Using PIVO2, we were able to discover better histories for two biological data sets previously discussed in research literature. The software can be found at http://compbio. fmph.uniba.sk/pivo/.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In evolution, genome rearrangements are rare genomic
events. They are large-scale mutations that change the
order and orientation of genes in genomes. The goal of our
work is to reconstruct likely gene orders in ancestral
extinct species given gene orders in present-day genomes.
Such history reconstruction can help biologists to study
evolutionary processes shaping genomes of living
organisms.</p>
      <p>In particular, we study the small phylogeny problem.
On input, we have a phylogenetic tree describing an
evolutionary history of a set of species. The leaves of the tree are
the extant (current) species, and the internal nodes are their
ancestors. We also know the order of genes in the genomes
of the extant species. The task is to reconstruct the gene
orders of the ancestors, while minimizing the overall
number of rearrangement operations that had to occur during
the evolution (Fig. 1).</p>
      <p>
        The exact definition of the problem depends on the
considered model of genome rearrangement. We use the
breakpoint model
        <xref ref-type="bibr" rid="ref15">(Tannier et al., 2009)</xref>
        and the
doublecut-and-join (DCJ) model
        <xref ref-type="bibr" rid="ref2">(Bergeron et al., 2006)</xref>
        , which
we describe in detail in the next section.
      </p>
      <p>The small phylogeny problem is NP-hard for most
genome rearrangement models, and it is in practice
addressed by heuristic algorithms (Sankoff et al., 1976;
S7
S5</p>
      <p>S6
S1</p>
      <p>S2</p>
      <p>S3</p>
      <p>S4</p>
      <p>S1:
S2:
S3:
S4:
1
1
1
1
2
2
3
2
3
3
2
3</p>
      <p>
        Moret et al., 2001; Adam and Sankoff, 2008; Kovácˇ et al.,
2011). In this paper, we describe a series of improvements
to a recent software called PIVO by
        <xref ref-type="bibr" rid="ref11">Kovácˇ et al. (2011)</xref>
        .
Our new implementation, PIVO2, runs faster and is able
to find histories closer to the optimum, as we demonstrate
in a series of experiments on both real and simulated data.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>Genes, chromosomes and genomes. To study genome
rearrangements, genomes are typically represented as
sequences of markers called genes. A gene is thus in this
context an identifier of a certain region in the genome.
Genes have an orientation, going from left to right or from
right to left. A sequence of genes is called a chromosome.
We will consider both linear chromosomes, which have
two ends called telomeres, and circular chromosomes,
which do not have a distinct start or end. A genome is
a set of chromosomes. In our work, we will consider a set
of genomes over a set of genes {1, 2, . . . , g} such that each
genome contains exactly one copy of each gene.</p>
      <p>To represent genomes more formally, we will call the
two ends (extremities) of a gene its head and tail and
denote them for gene a as a+ and a− respectively. A pair
of extremities located next to each other in a genome
is called an adjacency. Two consecutive genes do not
need to have the same orientation, and thus an adjacency
between genes a and b can be of four different types:
{a+, b−}, {a+, b+}, {a−, b−}, {a−, b+}. If an
extremity is not adjacent to any other gene, it is called a
telomere. We represent it by a singleton set {a−} or {a+}.
A genome is then a set of adjacencies and telomeres in
which the tail and head of every gene appear in exactly
one adjacency or telomere.</p>
      <p>Models of genome rearrangement. Genome
rearrangements are large-scale mutations that move one or several
genes to a different position or change the number or
character of the chromosomes (Figure 2). The most common
rearrangement is a reversal (or inversion), which happens
when a chromosome breaks at two points and the middle
part is joined back in the opposite direction.</p>
      <p>
        In literature, several genome models were developed
        <xref ref-type="bibr" rid="ref6">(Fertin et al., 2009)</xref>
        , which allow us to measure distance
between two genomes by counting the minimum number
of rearrangements needed to transform one genome to
another.
      </p>
      <p>
        In this work, we concentrate on the double cut and join
model introduced by
        <xref ref-type="bibr" rid="ref18">Yancopoulos et al. (2005)</xref>
        and revised
by
        <xref ref-type="bibr" rid="ref2">Bergeron et al. (2006)</xref>
        . This model allows a single
general rearrangement operation called double cut and join
(DCJ). Informally, we break chromosomes at at most two
positions and rejoin them in a different way. The DCJ
operation is explained more formally in the following
definition.
      </p>
      <p>Definition 1. The double cut and join operation acts on
two adjacencies or telomeres A1 and A2 in one of the
following three ways:
• If A1 = {p, q} and A2 = {r, s}, they are replaced by
adjacencies {p, r} and {q, s} or by adjacencies {p, s}
and {q, r}.
• If A1 = {p, q} and A2 = {r} (A2 is a telomere), they
are replaced by {p, r} and {q} or by {q, r} and {p}.
• If A1 = {p} and A2 = {q} (both are telomeres), they
are replaced by adjacency {p, q}.</p>
      <p>In addition, as the inverse of the third case, an adjacency
{p, q} can be replaced by two telomeres {p} and {q}.</p>
      <p>Using DCJ operations, we can simulate all the
common genome rearrangement operations, including
reversal, chromosome fusion and fission, as well as
circularisation and linearisation of chromosomes.</p>
      <p>The distance between two genomes π and σ is in the
DCJ model defined as the minimal number of DCJ
operations that transform π into σ . It can be efficiently
computed using the adjacency graph AG(π, σ ) (Figure 3).
This graph is a bipartite multigraph, with one partition for
1-5+
1+2+
each genome. The vertices correspond to the adjacencies
and telomeres of the two genomes. Every vertex v ∈ π
and w ∈ σ is connected by |v ∩ w| edges. So, if v and w
represent the same adjacency in both genomes, they are
connected by two edges. If v and w have one common
extremity, they are connected by a single edge, and if they
have no overlap, they are not connected by any edge.</p>
      <p>
        In the adjacency graph, every telomere has degree one
and every adjacency has degree two. If the two genomes
are equal, the graph consists of cycles of length two and
paths of length one.
        <xref ref-type="bibr" rid="ref2">Bergeron et al. (2006)</xref>
        have proved
that the DCJ distance of genomes π and σ can be
computed as follows. Let g be the number of genes, c the
number of cycles in AG(π, σ ) and pO the number of paths
of odd length. Then the DCJ distance between π and σ
is distDCJ(π, σ ) = g − (c + pO/2). This quantity can be
easily computed in O(g) time.
      </p>
      <p>
        We will also consider a simple breakpoint model
        <xref ref-type="bibr" rid="ref13 ref3">(Sankoff and Blanchette, 1997)</xref>
        , which does not describe
particular rearrangement operations, only defines the
distance between two genomes. Informally, this distance
counts the number of breaks we need to introduce to the
chromosomes of one genome so that by joining them in a
different way we obtain the other genome. For
multichromosomal genomes, it was defined by
        <xref ref-type="bibr" rid="ref15">Tannier et al. (2009)</xref>
        as follows.
      </p>
      <p>Definition 2 (Breakpoint distance). The breakpoint
distance of genomes π and σ is:
distBP (π, σ ) = g − a (π, σ ) −
e (π, σ )
2
where g is the number of genes, a(π, σ ) is the number of
common adjacencies in genomes π and σ , and e(π, σ ) is
the number of common telomeres in genomes π and σ .</p>
      <p>For example, the genomes in Figure 4 have five genes
each, two common adjacencies and one common telomere.
Their breakpoint distance is then 5 − 2 − 1/2 = 2.5.
The Small Phylogeny Problem. In the small phylogeny
problem, we are given a phylogenetic tree and the
genomes of the extant species (see Figure 1). The task
is to compute the genomes of ancestors, while minimizing
the number of genome rearrangement operations required
during evolution. We now define individual terms more
formally.</p>
      <p>A phylogenetic tree is a binary tree T = (V, E) rooted at
node r, describing evolutionary relationships among a set
of species. The leaves of the tree T are the extant species,
and each internal node represents the most recent common
ancestor of its children.</p>
      <p>Let G be the set of all possible genomes on a particular
set of genes. An evolutionary history h is a function that
assigns a genome from G to every node of tree T = (V, E):
h : V → G. The score of history h is
score(h, T ) =</p>
      <p>dist(h(u), h(v)),
∑
(u,v)∈E
where dist is the distance measure of the chosen
rearrangement model.</p>
      <p>In the small phylogeny problem, we are given a
phylogenetic tree T = (V, E) with leaves L ⊂ V . We are also
given a function g : L → G, which assigns a genome to
every leaf node. The task is to compute the evolutionary
history h, which extends function g to cover all the nodes
of the tree, while having the lowest possible score.</p>
      <p>
        The median problem is a special case of the small
phylogeny problem for three species. We are given three
genomes and the goal is to compute the genome
minimizing the sum of distances to the three input genomes.
The median problem was shown to be NP-hard for most
rearrangement models, including multichromosomal DCJ
model
        <xref ref-type="bibr" rid="ref15">(Tannier et al., 2009)</xref>
        . One interesting exception is
the breakpoint distance, where the median can be
computed in polynomial time
        <xref ref-type="bibr" rid="ref15">(Tannier et al., 2009)</xref>
        . Being
more general, the small phylogeny is also NP-hard in most
models. The NP hardness of the small phylogeny problem
for the breakpoint distance was shown by
        <xref ref-type="bibr" rid="ref10">Kovácˇ (2014)</xref>
        .
Solving the Small Phylogeny Problem. Probably the most
popular method for solving the small phylogeny problem
is the Steinerization method
        <xref ref-type="bibr" rid="ref14">(Sankoff et al., 1976)</xref>
        . In this
method, the algorithm iteratively improves the
evolutionary history until a local optimum is found. In each
iteration, the algorithm chooses an internal node v and
calculates median πM of the genomes in its three neighbouring
nodes ϕa, ϕb, ϕc. We replace the genome inside node v
with genome πM, if the new tree has a lower score. When
none of the internal nodes can be improved, the algorithm
has found a local optimum.
      </p>
      <p>Although the median problem is NP-hard in most
rearrangement models, several solvers have been developed
which in practice work in acceptable running time. The
h(g)</p>
      <p>Cg : cg,1, cg,2, cg,3, cg,4
h(e)</p>
      <p>h( f )
Ce : ce,1, ce,2, ce,3</p>
      <p>Cf : c f ,1, c f ,2, c f ,3
h(a)
h(b)
h(c)
h(d)</p>
      <p>cg,1
ce,2</p>
      <p>c f ,3
h(a)
h(b)
h(c)
h(d)</p>
      <p>
        Steinerization method was used for example in
BPAnalysis software for the breakpoint model
        <xref ref-type="bibr" rid="ref13 ref3">(Blanchette et al.,
1997)</xref>
        and in GRAPPA software
        <xref ref-type="bibr" rid="ref12">(Moret et al., 2001)</xref>
        for
both the breakpoint and reversal models. The
Steinerization method for the DCJ model was implemented by
        <xref ref-type="bibr" rid="ref1">Adam
and Sankoff (2008)</xref>
        . MGR
        <xref ref-type="bibr" rid="ref4">(Bourque and Pevzner, 2002)</xref>
        is another small phylogeny solver for the reversal model.
It uses a simple heuristic based on operations which bring
genomes closer to other genomes in the tree.
      </p>
      <p>Our work is however based on a different algorithm
called PIVO, which encompasses and extends
Steinerization approaches (Kovácˇ et al., 2011). Next, we describe
this algorithm in more details.</p>
      <p>The PIVO Software. The PIVO (Phylogeny by IteratiVe
Optimization) (Kovácˇ et al., 2011) is a small phylogeny
solver, which similarly to previous approaches uses a
local search to iteratively improve initial history until a local
optimum is found. While Steinerization methods update
one internal node at a time, PIVO can potentially update
values in all internal nodes together. This sometimes
allows it to escape from situations where no internal node
can be improved on its own.</p>
      <p>
        In every iteration, PIVO generates a set of candidate
genomes Cv for every internal node v of the tree. Although
        <xref ref-type="bibr" rid="ref11">Kovácˇ et al. (2011)</xref>
        describe several strategies for
generating these candidates, the PIVO software typically uses
as candidates for node v all genomes within DCJ distance
one from the current genome h(v).
      </p>
      <p>The local search then computes a new history h0 by
selecting one genome from each candidate list Cv. The new
history is chosen so that it has the smallest score out of
all histories that can be produced by selecting values from
candidate lists (see Figure 5).</p>
      <p>The new history h0 is computed by dynamic
programming as follows. Let us denote the i-th candidate of node v
as cv,i. Let M[v, i] be the lowest possible score we can get
for the subtree rooted at v, if h0(v) = cv,i. For an internal
node v with set of children X , we can compute value M[v, i]
using the following recurrence:</p>
      <p>M[v, i] = u∑∈X mjin{M[u, j] + dist(cv,i, cu, j)}.</p>
      <p>In the first phase of the dynamic programming, the
values M[v, i] are computed from leaves up (we set M[v, 0] = 0
if v is a leaf). In the second phase, we choose candidates
from root down. In particular, in the root, we can select
any candidate with the lowest score. For each node u with
parent v, we select the candidate cu, j for which the sum
of M[u, j] + dist(h0(v), cu, j) is minimal. If n is the number
of species, g is the number of genes in every genome, and
k is the number of candidates in every internal node, then
the best candidates can be selected in O(ngk2) time (we
suppose that the distance can be calculated in O(g) time,
which is the case for both DCJ and breakpoint distances).
The PIVO algorithm is flexible, and it can work with
several genome rearrangement models and distances.</p>
      <p>In practice, the local search is run several times from
different starting histories to increase the chance that
a near-optimal or optimal history will be found.
3</p>
    </sec>
    <sec id="sec-3">
      <title>PIVO2 Algorithm</title>
      <p>
        We have implemented a new version of the PIVO
algorithm, which we call PIVO2. Our main contributions are
randomization of candidate selection and a more efficient
algorithm for distance calculations in dynamic
programming. Before describing these two topics in more detail,
we briefly outline other changes made to the algorithm.
• The original PIVO software was written in Python;
we have reimplemented it in Java to improve the
running time.
• In PIVO, the local search starts from a history in
which each internal node is initialized with a
completely random genome. However, such histories are
usually very far from the optimum. In PIVO2, we
instead use the genomes of the extant species. In
particular, every internal node v with children u and w is
initialized randomly with h(u) or h(w).
•
        <xref ref-type="bibr" rid="ref11">Kovácˇ et al. (2011)</xref>
        discuss several strategies for
generating candidate list Cv in every iteration of the
algorithm. In PIVO2, we adapt the strategy, where the
candidate list Cv contains all histories within DCJ
distance 1 from h(v). Optionally, we prune from this list
all histories that increase the sum of distances to
current histories in the three neighbours of v too much.
In PIVO2, we also add all genomes h(u) from the
old history to the candidate list of every node v. This
method allows genomes to "jump" from one node to
another.
Based on a proposal in
        <xref ref-type="bibr" rid="ref11">Kovácˇ et al. (2011)</xref>
        , we
optionally also add the genomes from the previous
solutions to the candidate lists. The motivation is that
we may get a better evolutionary history by
combining different solutions.
• To avoid making the same decision in repeated
searches, we have implemented a Tabu search
metaheuristic
        <xref ref-type="bibr" rid="ref7 ref8">(Glover, 1989a,b)</xref>
        . The Tabu search tries
to avoid getting always the same local optimum by
keeping a tabu list of already visited configurations
and penalizing configurations that are already on the
list.
      </p>
      <p>PIVO2 keeps a tabu list Tv for every internal node v.
This list contains genomes which were assigned to v
in one of the previously considered histories.</p>
      <p>
        When the dynamic programming calculates the score
of a candidate cv,i, it adds a penalty 1/(|V | + 1) to
the score, if the candidate cv,i is in the tabu list Tv.
Thus, if there are good candidates which were not
present in previous solutions, the candidate cv,i is
not selected. However, histories with a smaller
rearrangement score will always be preferred, even if
they incur the tabu penalty in every node.
• In the DCJ model, a genome can be a mix of
linear and circular chromosomes. However, in the real
world, the organisms usually have either one circular
or several linear chromosomes. PIVO was designed
to be flexible with respect to allowed chromosome
architectures, because it was used to study the
mitochondrial genomes of yeasts from a group
containing both linear and circular genomes
        <xref ref-type="bibr" rid="ref16">(Valach et al.,
2011)</xref>
        . The preferred type of genome architecture
(circular or linear) can be prioritised by penalizing
genomes which do not have the preferred structure.
In PIVO2 we have implemented more refined
penalties. In particular, if the minimum number of DCJ
operations to transform a given genome cv,i to a
preferred genome architecture is r, we add penalty 2r to
M[v, i] in the dynamic programming algorithm. With
this penalty setting, the genomes in ancestral nodes
mostly have the preferred architecture.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Randomization of Candidate Selection in the</title>
      </sec>
      <sec id="sec-3-2">
        <title>PIVO2 Algorithm</title>
        <p>In the second phase of the dynamic programming
algorithm for candidate selection, the original PIVO software
always selects the first best candidate in each node.
However, often there are multiple solutions with the optimal
score. By choosing the first one, PIVO makes the same
decision in repeated searches. In the PIVO2 algorithm, we
have introduced randomized candidate selection, in which
we choose a random history out of all combinations of
candidates with the best score. The randomization ensures
that each of these evolutionary histories will be picked
with equal probability. To do so, we need to extend the
dynamic programming algorithm as outlined below.</p>
        <p>In the first phase of the dynamic programming, the
PIVO2 algorithm calculates the score M[v, i] of each
candidate cv,i, but it also counts how many solutions with this
score exist in the subtree rooted at v, if cv,i is selected. This
value is denoted Q[v, i]. If node v has a child u, we first find
the set B[u, i] of candidates in Cu for which optimal value
for cv,i is achieved. This set is computed as follows:
B[u, i] = {cu, j ∈ Cu | M[u, j] + dist(cu, j, cv,i) = M[v, i]}
If node v has children u and w, we can then use sets
B[u, i] and B[w, i] to compute Q[v, i]. First, the number of
solutions is summed in each subtree separately, and the
two sums are then multiplied.
If node v is a leaf, we set Q[v, 0] = 1.</p>
        <p>In the second phase of the dynamic programming
algorithm, we start at the root r and consider the set of
candidates with minimal score:</p>
        <p>B∗ = {cr,i | cr,i ∈ Cr, M[r, i] is minimal}
The algorithm then selects candidate cr,i ∈ B∗ with
probability Q[r, i]/Q[r], where Q[r] is the total number of
solutions with minimal score:
Choosing the candidate in other internal nodes is even
easier, because it is already known, which candidate was
selected in the parent node. If node v is a child of node p
and if cp,k was selected in p, then the algorithm selects a
candidate from B[v, k]. The probability of selecting
candidate cv,i ∈ B[v, k] is Q[v, i]/Q[v], where Q[v] is the number
of solutions with minimal score in the subtree rooted at v:
This randomized selection can be implemented without
increasing the overall time complexity of the dynamic
programming algorithm.
3.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Efficient Distance Computation in PIVO2</title>
        <p>The PIVO algorithm computes distances between many
pairs of candidates. When it calculates the scores M[v, i]
of candidates of node v (with children u and w), it
computes distances between every pair of candidates from Cv
and Cu, and from Cv and Cw, respectively.</p>
        <p>The PIVO2 algorithm uses the "Neighbour" and the
"Tree" strategies to generate candidate sets Cv. For every
π1
πn
π
σ
σ1
σm
node v, the "Neighbour" method generates Θ(g2)
candidates, where g is the number of genes. These candidates
are in DCJ distance 1 from the genome assigned to node v
in the previous iteration. The "Tree" method generates
Θ(|V |) candidates. Usually, the number of genes is larger
than the tree size, and so g2 |V |. Therefore, the
majority of distance calculations is made on pairs of genomes
πi and σ j, such that all πi have distance 1 from a genome
π, and all σ j have distance 1 from a genome σ (see
Figure 6). We have designed a faster method of calculating
the breakpoint and DCJ distances in such cases.</p>
        <p>We first describe the algorithm for efficient
breakpoint distance calculation. To represent adjacencies in
a genome, we use array G, which for every extremity
stores its adjacent extremity (see Figure 7). We call G the
genome array. Telomere extremity e is stored as G[e] = e.
The head of gene a is represented at index 2a − 1 and its
tail at index 2a − 2. Every adjacency is thus stored twice
and telomeres are stored once.</p>
        <p>Let Dπ,σ be the set of all indices where genome arrays
of genomes π and σ differ. We will call this set the
difference set between π and σ . The breakpoint distance of π
and σ can be calculated as |Dπ,σ |/2, and is thus related to
the Hamming distance between the genome arrays
representing the two genomes.</p>
        <p>We will first describe a subroutine
update(π, σ , σ 0, d, D), which gets genome arrays of
genomes π, σ and σ 0, the breakpoint distance d between
π and σ , and the difference set D between σ and σ 0. It
calculates the breakpoint distance between π and σ 0 in
O(|D|) time. We will initialize the distance with value d
and update it for every index i ∈ D as follows:
• If π[i] = σ 0[i], we subtract 1/2 from the distance.</p>
        <p>This is because σ [i] 6= π[i], and thus index i
contributes value 1/2 to the distance between π and σ ,
but it does not contribute anything to the distance
between π and σ 0.
• Otherwise if π[i] = σ [i], we add 1/2 to the distance.</p>
        <p>This is because index i does not contribute anything
to the distance between π and σ , but contributes 1/2
to the distance between π and σ 0.
• Otherwise the distance does not change, because
π[i] 6= σ [i] and π[i] 6= σ 0[i], and thus index i
contributes 1/2 to both distances.</p>
        <p>A DCJ operation cuts at most two adjacencies or
telomeres and creates at most two new adjacencies or telomeres</p>
        <sec id="sec-3-3-1">
          <title>Index Extremity Adjacent extremity</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Index</title>
          <p>Extremity
Adjacent extremity
using the same extremities. Thus the difference set
between the original and the new genomes will have size at
most four. During "Neighbour" candidate generation, we
know for every candidate which DCJ operation was
performed, and we can save this additional information as a
difference set without increasing the time complexity.</p>
          <p>Let us now describe a method for calculating the
breakpoint distance between every pair of genomes πi ∈ A and
σ j ∈ B, where every πi ∈ A is a genome in DCJ distance
1 from a genome π, and every σ j ∈ B is a genome in DCJ
distance 1 from a genome σ (see Figure 6). We know for
every genome πi ∈ A the difference set Dπ,πi and for every
genome σ j ∈ B the difference set Dσ,σ j . Our algorithm
first calculates the distanced of π and σ in O(g) time. The
computation of distance between πi and σ j then proceeds
in two update steps:
• d0 = update(π, σ , σ j, d, Dσ,σ j )
• distBP(πi, σ j) = update(σ j, π, πi, d0, Dπ,πi )</p>
          <p>Our algorithm reduces the time complexity of
computing distances between all pairs of genomes πi ∈ A and
σ j ∈ B from O(g · |A| · |B|) to O(g + u · |A| · |B|), where
u is the size of the difference sets. Note that in our case
u ≤ 4, and thus it can be considered a constant.</p>
          <p>Similar, but a more complex method also works for DCJ
distance calculation between all pairs of genomes πi ∈ A
and σ j ∈ B. Recall that to compute the DCJ distance, we
represent the two genomes as an adjacency graph, in which
each connected component is a cycle or a path. We can
compute the DCJ distance if we know the number of cycles
and paths of odd length in this graph.</p>
          <p>For our algorithm, it is more convenient to divide each
vertex containing two extremities into two vertices
connected by an auxiliary edge. For example the vertex
15+ in the top partition of Figure 3 is divided into vertices
1- and 5+ connected by an auxiliary edge. Regular edges
connect the new vertex 1- to its counterpart 1- in the other
genome and similarly for 5+. Each vertex then represents
one extremity in one of the genomes and cycles and paths
are simply lists of extremities. Note that auxiliary edges
are not counted when distinguishing paths of odd and even
lengths.</p>
          <p>Instead of difference sets, we will now characterize the
differences between a pair of genomes by a set of
disconnect and connect operations. Each disconnect operation
removes one auxiliary edge corresponding to a removed
adjacency between two extremities, and each connect
operation creates a new auxiliary edge, corresponding to a
new adjacency. A DCJ operation corresponds to at most
two disconnect and two connect operations.</p>
          <p>
            We spend O(g) time to compare genomes π and σ . We
compute their adjacency graph, find its components and
compute the DCJ distance. We can then compute the DCJ
distance between πi ∈ A and σ j ∈ B in O(o3) time, where
o is the size of the connect and disconnect sets between
π and πi and σ and σ j respectively. To do so, we store
the positions of individual extremities in the components
of the original graph. We maintain each modified
component of the graph after updates as a list of intervals from
the original components (which are paths or cycles). By
appropriately linking these structure together, we can
process each update (connect or disconnect) in O(o2) time.
Details are omitted due to space constraints and can be
found in the thesis by
            <xref ref-type="bibr" rid="ref9">Herencsár (2014)</xref>
            . The overall
running time to compute distances between all pairs is thus
O(g + o3 · |A| · |B|).
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we present the results of an experimental
comparison of PIVO2 to the original PIVO on real and
simulated data sets.</p>
      <p>
        Real data. We have first run PIVO2 on two real
biological data sets previously considered in literature. The first
data set consists of 13 chloroplast genomes from plants
of the Campanulaceae family
        <xref ref-type="bibr" rid="ref5">(Cosner et al., 2000)</xref>
        , each
genome consisting of a single circular chromosome. The
reconstruction of ancestral genomes of these species
under the DCJ model was studied in several earlier works,
as outlined in Table 1. We consider two cases: when
ancestors are restricted to have a single circular
chromosome and when their genome architecture is unrestricted.
In the unrestricted case, PIVO2 was able to achieve better
results than previous approaches, including PIVO. In the
restricted case, we match the score obtained by PIVO.
      </p>
      <p>
        The second data set contains 16 mitochondrial genomes
of pathogenic yeasts from the CTG clade of
Hemiascomycetes
        <xref ref-type="bibr" rid="ref16">(Valach et al., 2011)</xref>
        . As we have already
discussed, these genomes have various architectures: some
consist of a single linear chromosome, one consists of two
linear chromosomes, and the rest consist of a single
circular chromosome.
      </p>
      <p>Due to this variability in the genome architecture of
extant species, the ancestral genomes could have a single
circular chromosome, or one or more linear chromosomes.
The original PIVO algorithm found an evolutionary
history with score 78 (Kovácˇ et al., 2011), and the PIVO2
algorithm found a better evolutionary history with score
77. If we allow ancestral genomes with an arbitrary
architecture (including a mixture of circular and linear
chromosomes in the same ancestor), an evolutionary history with
score 75 was found by the PIVO2 algorithm (no result was
published for the original PIVO algorithm).</p>
      <p>Results on simulated data. We have also tested our
algorithm on simulated data and compared its speed and
accuracy to the original PIVO software. To generate data, we
have used the phylogenetic tree of the Campanulaceae. A
random genome consisting of 25 genes was generated for
the root, and random evolution consisting of DCJ
operations was simulated along branches of the tree. No
restrictions were applied on the karyotype, i.e. an arbitrary mix
of linear and circular chromosomes was allowed. The
extant genomes, which were produced by simulating random
evolution, were used as the input for both algorithms.</p>
      <p>Overall, we have generated 5 such test cases, which
differed by the number of mutations allowed along each edge,
ranging from 3 to 8. These histories have a relatively high
number of events in a short genome, and as a result, it
is often possible to find a lower-scoring history than the
one actually generating the data. For each input, we have
run both PIVO and PIVO2 local searches many times and
recorded the local optimum from each run. For each input,
the overall best history found by PIVO2 was better or the
same as the one for PIVO. In addition, the distribution of
local optima was shifted towards lower values for PIVO2,
as illustrated for one test case in Figure 8.</p>
      <p>On each input, PIVO2 performed on average more
iterations of candidate selection than PIVO in each run. For
example on the input considered in Figure 8, the average
number of iterations for PIVO was 5.3 and for PIVO2 it
was 8.6. However, this increase is amply justified by the
fact, that the PIVO2 algorithm gives better results than the
original PIVO.</p>
      <p>Finally, we have compared the speed of individual
candidate selection runs, to asses the impact of our faster
distance calculation. The original PIVO algorithm was
written in Python, and the PIVO2 algorithm was
reimplemented in Java. Inherently, Python runs slower than
Java, and therefore, it is difficult to compare the speed of
the two implementations meaningfully. According to our
measurements, PIVO2 (with efficient distance calculation
mode switched off) is approximately 6-7 times faster than
the original PIVO. As we will show next, PIVO2 runs even
faster in the efficient distance calculation mode.</p>
      <p>We have again created several simulated test cases, but
in these tests we gradually increased the number of genes.
The PIVO2 algorithm was run with the efficient distance
calculation mode enabled or disabled, and we measured
the average time needed to compute the distances
between each pair of candidates of two neighbouring internal
nodes. The results are shown in Figure 9. As we would
expect, the speedup increases with the number of genes,
because we have lowered the asymptotic complexity.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we have introduced several improvements to
the PIVO algorithm for reconstruction of ancestral gene
orders. The resulting software, PIVO2, was able to find
histories with better score and has a potential to be useful
for evolutionary studies on real biological data sets.</p>
      <p>One possibility for further research is in the area of
efficient distance computation. Our algorithm could be
extended to sets of candidates which are likely to contain
clusters of very similar genomes, but these clusters are not
given explicitly in advance. The algorithm would have to
discover such clusters automatically, choose a
representative of each cluster, compute distances between
representatives and then adjust the distances for other members of
the clusters.</p>
      <p>Acknowledgments. This research was supported by VEGA
grants 1/1085/12 and 1/0719/14.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Adam</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>The ABCs of MGR with DCJ</article-title>
          .
          <source>Evolutionary Bioinformatics Online</source>
          ,
          <volume>4</volume>
          :
          <fpage>69</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bergeron</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mixtacki</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Stoye</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>A unifying view of genome rearrangements</article-title>
          .
          <source>In Workhop on Algorithms in Bioinformatics (WABI)</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Blanchette</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bourque</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Breakpoint phylogenies</article-title>
          .
          <source>In Workshop on Genome Informatics</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bourque</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pevzner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Genome-scale evolution: reconstructing gene orders in the ancestral species</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cosner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moret</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raubeson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warnow</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wyman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>An empirical comparison of phylogenetic methods on chloroplast gene order data in Campanulaceae</article-title>
          . In Sankoff, D. and
          <string-name>
            <surname>Nadeau</surname>
          </string-name>
          , J. H., editors,
          <source>Comparative Genomics</source>
          , pages
          <fpage>99</fpage>
          -
          <lpage>121</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Fertin</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labarre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vialette</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Combinatorics of genome rearrangements</article-title>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Glover</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1989a</year>
          ).
          <source>Tabu Search - Part I. ORSA Journal on Computing</source>
          ,
          <volume>1</volume>
          :
          <fpage>190</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Glover</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1989b</year>
          ).
          <article-title>Tabu Search - Part II</article-title>
          .
          <source>ORSA journal on Computing, 2</source>
          <volume>1</volume>
          :
          <fpage>4</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Herencsár</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>An improved algorithm for ancestral gene order reconstruction</article-title>
          .
          <source>Master's thesis</source>
          , Comenius University in Bratislava.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kovácˇ</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>On the complexity of rearrangement problems under the breakpoint distance</article-title>
          .
          <source>Journal of Computational Biology</source>
          ,
          <volume>21</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kovácˇ</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Brejová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and Vinarˇ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>A practical algorithm for ancestral rearrangement reconstruction</article-title>
          .
          <source>In Workshop on Algorithms in Bioinformatics (WABI)</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Moret</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wyman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warnow</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and M.,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>A new implementation and detailed study of breakpoint analysis</article-title>
          .
          <source>In Pacific Symposium on Biocomputing</source>
          , pages
          <fpage>583</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Blanchette</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>The median problem for breakpoints in comparative genomics</article-title>
          .
          <source>In Conference on Computing and Combinatorics (COCOON)</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cedergren</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lapalme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1976</year>
          ).
          <article-title>Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA</article-title>
          .
          <source>Journal of Molecular Evolution</source>
          ,
          <volume>7</volume>
          :
          <fpage>133</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sankoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Multichromosomal median and halving problems under different genomic distances</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>10</volume>
          :
          <fpage>120</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Valach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farkas</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fricova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brejova</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfeiffer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kucsera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomaska</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Nosek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Evolution of linear chromosomes and multipartite genomes in yeast mitochondria</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>39</volume>
          (
          <issue>10</issue>
          ):
          <fpage>4202</fpage>
          -
          <lpage>4219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>A. W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Moret</surname>
            ,
            <given-names>B. M. E.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>GASTS: Parsimony scoring under rearrangements</article-title>
          .
          <source>In Workshop on Algorithm in Bioinformatics (WABI)</source>
          , pages
          <fpage>351</fpage>
          -
          <lpage>363</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Yancopoulos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Attie</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Friedberg</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Efficient sorting of genomic permutations by translocation, inversion and block interchange</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>21</volume>
          :
          <fpage>3340</fpage>
          -
          <lpage>3346</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>