Combinatorial Optimization Solutions for the
       Maximum Quartet Consistency Problem

                    António Morgado and Joao Marques-Silva
    School of Electronics and Computer Science, University of Southampton, UK
                     ajrm@soton.ac.uk,jpms@ecs.soton.ac.uk

                                        Abstract
           Given a set of taxa S and a complete set of quartet topologies
       Q over S, the problem of determining a phylogeny that satisfies the
       maximum number of topologies is called the Maximum Quartet Consis-
       tency (MQC) problem. The MQC problem is an NP-problem. MQC
       has been solved both heuristically and exactly. Exact solutions for
       MQC include Constraint Programming, Answer Set Programming and
       Pseudo-Boolean Optimization (PBO). This paper extends the range of
       solutions for MQC, by developing two new PBO models and also by
       introducing models based on the optimization of Satisfiability Modulo
       Theories (SMT). The models were experimentally compared with ex-
       isting exact solutions. The results show that for instances with small
       percentage of quartet errors, the models based on SMT can be com-
       petitive, whereas for instances with higher number of quartet errors
       the PB-models are more efficient.


1     Introduction
Evolutionary trees (or phylogenies) of a given set of taxa (a set of biological
species) are often used for studying the evolutionary relationship between the
given taxa. Current phylogenetic analyses are carried out using biological
data (such as DNA or protein sequences). Since the amount of available
data is different for different species, biologists have to balance the number
of taxa used and the amount of available data common to all the species.
This is know as the Data Disparity problem [1]. In order to overcome the
Data Disparity problem, quartet based methods have been suggested.
    Quartet based methods divide the task of inferring an evolutionary tree
in two subtasks. First, a set of trees for subsets of four taxa is obtained.
Then from the trees obtained a global tree is constructed. Considering only
four taxa in the first step, maximizes the amount of available data used, and
the confidence on the tree produced. These evolutionary trees for subsets
of four taxa are called quartet topologies. The quartet topologies produced
may still be conflicting or even missing. Given a set of quartet topologies, the
problem of computing an evolutionary tree that agrees with the maximum
number of quartet topologies is referred to as the Maximum Quartet Consis-
tency (MQC) problem. In recent years, solutions based on Answer-Set Pro-
gramming (ASP) have been identified as the most efficient [18]. This paper
proposes alternative combinatorial optimization solutions for the MQC prob-
lem. The first set of models extend recent work on using Pseudo-Boolean

Proceedings of the 15th International RCRA workshop (RCRA 2008):
Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion
Udine, Italy, 12–13 December 2008
             a       c                a       b                a       b
             b        d               c        d               d        c
            [a, b|c, d]              [a, c|b, d]              [a, d|b, c]


Figure 1: Graphical representation of the quartet topologies [a, b|c, d],
[a, c|b, d], [a, d|b, c].


Optimization (PBO) [10]. Moreover, the paper also proposes models based
on Satisfiability Modulo Theories (SMT).
    The paper is organized as follows. The next section introduces the MQC
problem and surveys representative solutions. Afterwards, Section 3 summa-
rizes recent work on using PBO for solving the MQC problem, and develops
new PBO models. The use of SMT for MQC is described in Section 4.
Section 5 analyzes experimental results on real and artificially generated
instances. Finally, the paper concludes in Section 6.


2    Maximum Quartet Consistency
This section defines the problem of Maximum Quartet Consistency (MQC),
and reviews existing solutions, either heuristic or exact.
     Consider a given set of taxa S = {s1 , . . . , sn }. A phylogeny on S is a
tree such that each leave is mapped to a taxon in S. Each internal node in
the phylogeny represents an extinct or hypothesized ancestor of the taxa in
the subtrees that the internal node connects. A phylogeny may be rooted or
unrooted. In a rooted phylogeny a link between two nodes represent a parent-
child relation, in which the root of the phylogeny is a common ancestor of
all the taxa. A phylogeny is called binary or resolved if every internal node
has degree three (in a rooted phylogeny the root node has degree two). In
the rest of the paper the term phylogeny refers to an unrooted resolved
phylogeny, unless explicitly stated.
     A subset of four different taxa is called a quartet. A phylogeny for a
quartet is called a quartet topology (simply refered as topology). In this
paper, for a quartet {a, b, c, d} ⊆ S three different topologies are considered
(up to symmetry) denoted by [a, b|c, d], [a, c|b, d] and [a, d|b, c] as represented
in Figure 1. Consider a quartet q ⊆ S and a phylogeny T on S. The topology
obtained from T by removing all nodes not in the paths connecting the nodes
mapped to taxa in q is called the topology of q derived from T . On the left
side of Figure 2, a phylogeny is shown where the paths that connect taxa a,
b, c and f have been marked with dotted lines. In this example the derived
topology for quartet {a, b, c, f } is [a, b|c, f ] (respresented on the right side of
Figure 2).
     A topology qt is said to be consistent with a phylogeny T or T satisfies
qt if the derived topology for the quartet of qt from T is the same as qt. The
topology in Figure 2 is consistent with the phylogeny in the same figure.
     Given a set of topologies Q whose set of taxa is S, Q is said to be complete


                                          2
                   g       f
                                   e
                                                   a      c
                  a            c   d
                       b                           b      f


   Figure 2: Graphical representation of a phylogeny and of a topology.

if Q contains a topology for each of the Cn4 quartets of S; otherwise Q is
said to be incomplete. The set Q is said to be compatible if there exists a
phylogeny T on S that satisfies all the topologies in Q, and T is called the
associated phylogeny of Q. The MQC problem can now be defined as:
Definition 1 (Maximum Quartet Consistency problem (MQC)). Given a
set of taxa S = {s1 , . . . , sn } and a set of topologies Q on S, the Maximum
Quartet Consistency problem is the problem of determining a phylogeny T
on S that satisfies the maximum number of topologies of Q.
     The MQC problem is a NP-hard problem [2]. If Q is complete, then MQC
admits a polynomial-time approximation scheme [8]. If Q is incomplete then
MQC is MAX SNP-hard [8]. Existing approaches for the MQC problem can
be categorized as either heuristic or exact. Heuristic approaches return
an approximation to the optimal phylogeny. Among the many heuristics
proposed over the years, a few are overviewed next. The short quartet
methods of Erdos et al. [5], the use of semi-definite programming by Ben-
Dor et al. [1], the quartet cleaning algorithms of Berry et al. [2]. More
recently, in 2008, Wu et al. [15] proposed three algorithms that return a
phylogeny with a probability of success.
     This paper focuses on exact solutions for the MQC problem, which guar-
antee that the phylogeny returned satisfies the maximum number of topolo-
gies. Existing exact approaches can be categorized in four main classes.
Bendor et al. [1] proposed the use of Dynamic Programming to solve the
problem of MQC where the topologies given have weights associated to them.
Their objective is to compute a phylogeny with a maximal score, that is, a
phylogeny whose sum of weights of the satisfied topologies is maximal.
     The Fixed-Parameter Algorithm was proposed by Gramm and Nieder-
meier [7]. The objective of the algorithm is to compute a phylogeny that
fails to satisfy no more than a given k topologies. Their algorithm works in
time O(4k n + n4 ), where k is the given number of errors and n = |S|.
     Wu et al. [16] proposed the use of Answer Set Programming (ASP) for
MQC. The proposed approach searches for an ultrametric matrix that sat-
isfies the maximum number of topologies. From the ultrametric matrix it is
possible to obtain a phylogeny in polynomial time.
     More recently, in 2005, Wu et al. [17] presented a lookahead branch-and-
bound algorithm for MQC. The algorithm can be seen as an improvement
over the Fixed-Parameter Algorithm [7], and has the same running time of
O(4k n + n4 ). However, in contrast with the Fixed-Parameter Algorithm,
the number of quartet errors is not required to be known.

                                       3
   Finally, in 2008, Morgado and Marques-Silva [10] translated the problem
of MQC to Pseudo-Boolean Optimization (PBO) .


3     Pseudo-Boolean Optimization Models for MQC
Pseudo-Boolean Optimization (PBO) models for MQC were first proposed
in [10]. This section reviews in some detail the PBO models followed by the
description of two new PBO models based on the existing models.

3.1     Existing Pseudo Boolean Optimization Models
Given a complete set of topologies Q over a set of taxa S, the objective of
the PBO models for MQC is to encode a phylogeny T of S that satisfies the
maximum number of topologies in Q. In order to obtain T , and similarly
to the use of ASP, the PBO models encode an ultrametric matrix M . From
the ultrametric matrix M it is possible to obtain a rooted phylogeny T .
    The ultrametric matrix M is a square symmetric matrix n × n (where
n = |S|) whose entry values are labels that range from 0 to n − 1. Each
row and column of the ultrametric matrix M is indexed by the given taxa,
and the value of an entry M (i, j) is the label of the internal node in T
corresponding to the lowest common ancestor between leaf nodes si and sj .
The labels of the internal nodes of T have the property that for any path
from root node to a leaf, the corresponding labels are in decreasing order.
To guarantee that M is an ultrametric matrix, M must satisfy the following
three properties [16]: (a) all values in the matrix must be between 1 and
n, except for the elements on the main diagonal which must be 0, that is
M (i, i) = 0; (b) M is symmetric, thus M (i, j) = M (j, i); and (c) for each
triple (i, j, l) such that 1 ≤ i, j, l ≤ n then

                M (i, j) = M (i, l) ∧ M (i, l) > M (j, l) ∨               (1)
                M (i, j) = M (j, l) ∧ M (j, l) > M (i, l) ∨               (2)
                M (j, l) = M (i, l) ∧ M (i, l) > M (i, j)                 (3)

Wu et al. [18] proved that in order to obtain an optimal phylogeny, the
values of the entries of M can be restricted to M (i, j) ≤ ⌈ n2 ⌉.
    The PBO models can be divided in three main parts: encoding of the
ultrametric matrix; encoding of the consistency of the topologies; and en-
coding of the cost function.

3.1.1    Basic PBO Model
We start by detailing the Basic PBO model. To encode the ultrametric
matrix, the Basic PBO model considers the boolean variables Mi,j,k where
1 ≤ i < j ≤ n and 1 ≤ k ≤ ⌈ n2 ⌉. The semantics of variable Mi,j,k is
Mi,j,k = 1 if M (i, j) = k, otherwise Mi,j,k = 0. For a given pair (i, j) one


                                     4
and only one k is allowed to have Mi,j,k = 1. To ensure this condition, the
Basic PBO model introduces the following constraints:
                                  ⌈2⌉
                                   n
                                  X
                                        Mi,j,k = 1                              (4)
                                  k=1
                                                             P⌈ n2 ⌉
The value of an entry M (i, j) is given by M (i, j) = k=1            k × Mi,j,k . To
ensure that the resulting matrix M is ultrametric, the Basic PBO model has
to guarantee that one of the conditions (1), (2) or (3) is satisfied. The Basic
PBO model associates conditions (1), (2) and (3) with new boolean variables
c1i,j,l , c2i,j,l and c3i,j,l . Variable cxi,j,l has value 1 if the corresponding
condition x is satisfied, otherwise has value 0. To force one of the variables
to have value 1 the Basic PBO model introduces following constraints:
                           c1i,j,l + c2i,j,l + c3i,j,l ≥ 1                      (5)
    Consider variable c1i,j,l and the corresponding condition (1). Condition
(1) is a logical AND of two conditions (an equality condition and a greater
than condition). To encode the value of variable c1i,j,l , the Basic PBO model
                                           (=)       (>)
introduces two new boolean variables c1i,j,l and c1i,j,l and associates them
with the equality condition and the greater than condition, respectively.
c1i,j,l is encoded as an AN D gate that is:
                                                   (=)   (>)
                          c1i,j,l ≡ AN D(c1i,j,l , c1i,j,l )                    (6)
                                 (=)             (>)
    The values of variables c1i,j,l and c1i,j,l are encoded as comparator cir-
cuits on the unary representation of M (i, j), M (i, l) and M (j, l) (that is,
using variables Mi,j,k , Mi,l,k and Mj,l,k ). Variables c2i,j,l and c3i,j,l are en-
coded in an analogous way.
    According to [16], an ultrametric matrix M satisfies a topology [i, j|l, m]
if and only if one of the following conditions is satisfied:
                (M (i, l) > M (i, j) ∧ M (j, m) > M (i, j)) or                  (7)
               (M (i, l) > M (l, m) ∧ M (j, m) > M (l, m))                      (8)
    The Basic PBO model considers a new boolean variable qt that encodes
the consistency of the t-th topology. Suppose that the t-th topology in Q is
topology [i, j|l, m]. The model encodes the value of qt by introducing two
new boolean variables q1t , q2t and associates them with conditions (7) and
(8), respectively. The value of qt is then encoded as an OR gate:
                                qt ≡ OR(q1t , q2t )                             (9)
The values of variables q1t and q2t are encoded in a similar way as variables
c1i,j,l . Finally, the cost function of the Basic PBO model is to maximize the
number of variables qt with value 1:
                                             |Q|
                                             X
                                       max         qt                          (10)
                                             t=1


                                             5
3.1.2    Reduced PBO model
The Reduced PBO model is based on the Basic PBO model, but where
auxiliary variables are used that allow a simpler encoding of greater than
conditions (M (i, j) > M (l, m)) using fewer variables. In the Reduced model,
constraint (4) is split into two conditions:
                    ⌈2⌉
                     n
                                             ⌈2⌉
                                               n
                    X                        X
                          Mi,j,k ≥ 1,               Mi,j,k ≤ 1              (11)
                    k=1                       k=1

    The first condition is in PB format and so can be added to the model,
whereas the second condition is encoded using sequential counters by Sinz [14].
These counters introduce auxiliary variables si,j  k with the semantics that if
M (i, j) = a, then for all 1 ≤ k < a, sk = 0 and for a ≤ k ≤ ⌈ n2 ⌉, si,j
                                       i,j
                                                                         k = 1.
    Instead of using comparator circuits for encoding constraints of the form
M (i, j) > M (l, m) as in the Basic PBO model, the Reduced PBO model
uses variables si,j        l,m
                 k and sk . The constraint M (i, j) > M (l, m) is satisfied
if and only if there is a k′ such that si,j            l,m
                                          k ′ = 0 and sk ′ = 1. The constraint
is encoded using for each k an AND gate of N OT (si,j             l,m
                                                         k ) and sk . The final
output is encoded as an OR gate of the AND gates. Moreover, all the
constraints of the Basic PBO model are maintained in the Reduced PBO
model (except for constraint (4)), but using the new variables si,j  k .


3.1.3    Binary PBO model
The last PBO model proposed in [10] was the Binary PBO model. The
rationale of the Binary PBO model is to replace the encoding of M (i, j)
by changing the semantics of variables Mi,j,k . In the Binary PBO model,
variable Mi,j,k represents the k-th bit of the binary representation of M (i, j).
As a consequence the limits of index k are reduced to 0 ≤ k ≤ ⌊log2 (⌈ n2 ⌉)⌋,
therefore reducing the number of variables.
                                                        P⌊log (⌈ n ⌉)⌋
    Variables M (i, j) are now encoded as M (i, j) = k=02 2 2k × Mi,j,k .
Moreover, all the constraints of the Basic PBO model are maintained (using
the new limits of index k) except for constraints (4) which are replaced by
             P⌊log (⌈ n ⌉)⌋
constraints ( k=02 2 Mi,j,k ≥ 1) and a comparator circuit to represent
     P⌊log (⌈ n ⌉)⌋
that k=02 2 Mi,j,k ≤ ⌈ n2 ⌉.

3.2     New Pseudo Boolean Optimization Models
This section proposes two new PBO models, namely the Unary PBO model
and the Binary PBO model with a dedicated comparator.

3.2.1    Unary PBO Model
The rationale of the Unary model is to change the semantics of the selection
variables Mi,j,k and merge them with the semantics of the auxiliary variables

                                        6
                               k
                          1 2 3 4 5 6
                   Mi,j,k 0 0 0 1 0 0                           k
                                                           1 2 3 4 5 6
                     si,j
                      k   0 0 0 1 1 1               si,j
                                                     k     1 1 1 1 0 0
                                (a)                             (b)


Figure 3: Example of a valuation for selection and auxiliary variables when
M (i, j) = 4 and n = 12. (a) Represents a valuation as in the Reduced
model. (b) Represents a valuation as in the Unary model.


si,j
 k . The objective is for the new variables to be able to represent the
values in the ultrametric matrix and still allow the simplified encoding of
M (i, j) < M (l, m).
     Consider for example that n = 12, then k ≤ ⌈ n2 ⌉ = 6. For some position
(i, j) in the matrix, suppose that M (i, j) = 4. A valid valuation of the
selection variables and of the auxiliary variables is given in Figure 3 (a).
     The Unary model uses only the auxiliary variables si,j     k , but with a new
semantics, which is if M (i, j) = a for same a, 1 ≤ a ≤ ⌈ n2 ⌉, for all 1 ≤ k ≤ a,
then si,j                                        n         i,j
        k has value 1, and for all 1 < k ≤ ⌈ 2 ⌉, then sk has value 0.
     M (i, j) can then be obtained by:
             (
                P⌈ n2 ⌉−1 i,j      i,j            P⌈ n2 ⌉−1 i,j     i,j
M (i, j) =          k=1  [(s k − s k+1 ) × k] if ( k=1 [(sk − sk+1 ) × k]) 6= 0
                ⌈ n2 ⌉                        otherwise

For the previous example, the new valuation is represented in Figure 3 (b).
Given the new semantics, condition M (i, j) > M (l, m) is true if there is a k′
such that si,j              l,m
            k ′ = 1 and sk ′ = 0. The Reduced model considers new boolean
           (i,j)(l,m)                                      (i,j)(l,m)      (i,j)(l,m)
variables ek          The semantics of variables ek                   is ek           has value
      i,j           l,m                      (i,j)(l,m)                                 (i,j)(l,m)
1 if sk = 1 and sk = 0, otherwise ek                     has value 0. Variables ek
                                  (i,j)(l,m)
are encoded as AND gates, ek                  ≡ AN D(si,j                 l,m
                                                            k , N OT (sk )).
     The greater than condition M (i, j) > M (l, m) is then encoded by a new
                                                                                       (i,j)(l,m)
variable GT(i,j)(l,m) whose value is given as an OR gate of variables ek                          ,
that is:
                                            (i,j)(l,m)                k
                    GT(i,j)(l,m) ≡ OR(ek                : 1 ≤ k ≤ ⌈ ⌉)
                                                                       2
     As with the Reduced model, the cost function and all the constraints of
ultrametricity of matrix M and of quartet consistency of the PB model are
maintained, but making use of variables GT(i,j)(l,m) as appropriate.

3.2.2     Binary PBO Model with Specialized Comparator
Sinz [14] presented a specialized comparator circuit to represent that a bi-
nary word (bm bm−1 . . . b0 ) is smaller or equal to a number n. An optimization


                                                7
of the Binary PBO model corresponds to use the comparator circuit pro-
posed by [14] for encoding that M (i, j) ≤ ⌈ n2 ⌉. Suppose ⌈ n2 ⌉ is given by the
binary word (bm bm−1 . . . b0 ). The recursive equations of [14] (that obtain a
set of clauses) can be modified to obtain a set of sums:
                                              
             {1 − Mi,j,0 }, b0 = 0              {1 − Mi,j,r } ∪ Cp(br−1 ) , br = 0
Cp(b0 ) =                          , Cp(br ) =
             ∅            , b0 = 1              {1 − Mi,j,r } ⊗ Cp(br−1 ), br = 1

where A ⊗ B = {x + y|x ∈ A, y ∈ B}. Then the set of conditions to add to
the model is the set {s ≥ 1|s ∈ Cp(bm )}.
   All the constraints of the Binary PBO model are maintained except for
the comparator circuit that encodes the restriction M (i, j) ≤ ⌈ n2 ⌉, which is
now encoded according to the above equations.


4     Satisfiability Modulo Theories for MQC
This section outlines the use of Satisfiability Modulo Theories (SMT) for
solving the MQC problem, by describing how to optimize a linear cost func-
tion subject to a SMT formula. This solution is referred to as the maximum
Satisfiability Modulo Quartet Consistency (mSMQC) model. The mSMQC
model follows the same principles of the PBO models of Section 3, but using
the modelling capabilities of SMT. Thus the mSMQC model encodes an ul-
trametric matrix M and checks for the consistency of the topologies given,
against the values in M .
    The mSMQC model can be divided in four parts:(1) Domain Restriction,
defines the variables to be used and limits the domains of the variables; (2)
Ultrametric Constraints, enforces the matrix to be ultrametric; (3) Quartet
Consistency, encodes the consistency of the topologies given; (4) Bounding,
maximizes the number of satisfied topologies. The first three define the
problem of determining the consistency of a set of topologies as a SMT-
instance. The last seeks to maximize the number of satisfied topologies.

4.1   Quartet Consistency as an Instance of SMT
Domain Restriction. Consider a given set of topologies Q over the set of
taxa S = {s1 , . . . , sn }. In the PB-models the solution to the MQC problem
was given by encoding the resulting phylogeny with an ultrametric matrix
M . In the mSMQC model, each position of the matrix is associated with
one integer variable containing the value in that position. For each pair of
taxa (si , sj ) ∈ S, such that 1 ≤ i < j ≤ n, the mSMQC model considers
the integer variable Mi,j . Variable Mi,j has value k with 1 ≤ k ≤ ⌈ n2 ⌉ if and
only if position (i, j) of M has value k.
    To constrain the domain of variables Mi,j (1 ≤ i < j ≤ n), the mSMQC
model introduces the following constraints:
                                                n
                         ((Mi,j ≥ 1) ∧ (Mi,j ≤ ⌈ ⌉))                         (12)
                                                2

                                        8
Ultrametric Constraints. Conditions (1), (2) and (3) in page 4 guarantee
the ultrametricity of a matrix. These conditions are guaranteed by the
mSMQC model by, for each triple of taxa si , sj , sl ∈ S, such that 1 ≤ i <
j < l ≤ n, adding the constraints:

  [(Mi,j = Mi,l ) ∧ (Mi,l > Mj,l )] ∨ [(Mi,j = Mj,l ) ∧ (Mj,l > Mi,l )] ∨
                                                                              (13)
                     [(Mi,l = Mj,l ) ∧ (Mj,l > Mi,j )]

Quartet Consistency. The consistency of each topology given is associ-
ated with a new integer variable in the model. Suppose the t-th topology in
Q is of the form [si , sj |sl , sm ]. A new variable qt is created and qt will have
value 1 if the topology is satisfied by the current solution, qt will value 0
otherwise. To constraint the value of qt , the mSMQC uses the ite operator,
and adds the following constraints:

 ite[(((Mi,l > Mi,j ) ∧ (Mj,m > Mi,j )) ∨ ((Mi,l > Ml,m ) ∧ (Mj,m > Ml,m ))),
     (qt = 1), (qt = 0)]
                                                                          (14)
    All the constraints added to the mSMQC model are constraints of Linear
Integer Arithmetic (LIA).

4.2     Maximization Problems in SMT
For the PBO models of Section 3, the maximization was achieved through
a cost function. SMT is a decision problem and as such, there is no cost
function to be optimized. Nevertheless, recent work addressed optimization
problems in SMT, including weighted MAX-SMT [11]. In the following we
consider how to maximize the number of topologies satisfied by either using
bounds or MAX-SMT.

4.2.1    Maximization by Bounding Bottom-Up
Let nsqt be the number of satisfied topologies. Thus 1 ≤ nsqt ≤ |Q|, that
is at least one topology is satisfied and at most all topologies are satisfied.
    The motivation for Bounding by Bottom-Up is to search for the maxi-
mum number of satisfied topologies iteratively, by creating instances of SMT
and giving them to a SMT solver.
    In the Bounding by Bottom-Up approach a lower bound lb nsqt for nsqt
is maintained. This lower bound has the property that is always possible to
satisfy at least lb nsqt topologies. The initial value of lb nsqt is 1. In each
iteration, a SMT instance is created as described in Section 4.1 to which is
added the constraint:
                                |Q|
                               X
                              (     qt > lb nsqt)                          (15)
                                r=1

This instance is given to the SMT solver and if it returns SATISFIABLE,
then the lower bound is updated and a new iteration starts with the new


                                        9
lower bound. If the SMT solver returns UNSATISFIABLE then mSMQC
returns and the maximum number of satisfied topologies is lb nsqt.
    The way the lower bound is updated depends on the SMT solver used. If
the SMT solver is capable of reporting an assignment to the variables every
time time it returns SATISFIABLE, then the new lower bound is given by
the number of topologies satisfied by the assignment reported. Otherwise,
the lower bound can only be increased by one.

4.2.2   Maximization by Bounding Top-Down
The idea of the Bounding by Top-Down is the opposite of the previous
Bounding by Bottom-Up. Instead of iterating through satisfiable SMT in-
stances, the Top-Down approach will iterate through a set of unsatisfiable
SMT instances till a satisfiable instance is found.
   The Bounding by Top-Down maintains an upper bound up nsqt for nsqt.
This upper bound up nsqt has the property that is impossible to satisfy
(up nsqt + 1) topologies. The initial value of up nsqt is the total number of
topologies |Q|. In each iteration the constraint is added to the model:
                              |Q|
                             X
                            (     qt ≥ up nsqt)                          (16)
                             r=1

The instance is given to a SMT solver, and if the SMT solver returns UN-
SATISFIABLE then up nsqt is updated to (up nsqt−1), and a new iteration
is started. If the SMT solver returns SATISFIABLE then mSMQC returns
with a maximum number of satisfied topologies of up nsqt.

4.2.3   Maximization by a Mixed Bounding
The idea of the Mixed Bounding approach is to search for the maximum
number of satisfied topologies using binary search and augment it with the
Bounding by Bottom-Up approach.
    The Mixed Bounding approach maintains both the lower bound lb nsqt
and the upper bound up nsqt (as in the previous bounding approaches).
Consider max nsqt to be the exact maximum number of topologies satisfied.
At each iteration, both the following conditions are true: (1) max nsqt ≥
lb nsqt: and (2) max nsqt < up nsqt. When lb nsqt + 1 = up nsqt then
max nsqt has been found to be lb nsqt.
    In each iteration the mSMQC with the Mixed Bounding approach consid-
ers the value in the middle of the bounds. Let mdl nsqt = ⌊ up nsqt+lb
                                                                   2
                                                                       nsqt
                                                                            ⌋.
If mdl nsqt = lb nsqt then max nsqt is lb nsqt and mSMQC returns. Oth-
erwise mSMQC with Mixed Bounding adds the constraints:
              |Q|                              |Q|
             X                                X
            (     qt ≥ mdl nsqt),            (     qt < up nsqt)         (17)
             r=1                              r=1


                                     10
If the SMT solver returns UNSATISFIABLE, then the upper bound is up-
dated to the value in the middle (up nsqt = mdl nsqt) and a new iteration
starts. If the SMT solver returns SATISFIABLE, then the lower bound is
updated and a new iteration starts with the new lower bound. As in the
Bounding by Bottom-Up approach, the way the lower bound is updated de-
pends on the SMT solver used. If the SMT solver is capable of returning
satisfying assignments on satisfiable instances, then the lower bound is up-
dated to the number of satisfied topologies by the assignment reported. If
the SMT solver is unable to report satisfying assignments, then the lower
bound is updated to the value in the middle, that is, lb nsqt = mdl nsqt.

4.2.4        Maximization as a MAX-SMT Instance
In MAX-SMT the overall formula is an unsatisfied conjunction of SMT sub-
formulas. Each sub-formula is associated with a cost of unsatisfying the
sub-formula. The objective is to determine an assignment that minimizes
the sum of the costs of the unsatisfied sub-formulas. For example, consider
the formula ϕ ≡ (x < 10) ∧ (y > 20) ∧ (x − y > 40) ∧ (y < 30),
where subformulas (x − y > 40) and (y < 30) have costs 3 and 10, respec-
tively. Note that not all sub-formulas need to have a cost. The sub-formulas
without a cost have to be satisfied, and are called hard sub-formulas. The
sub-formulas with a cost (which may or may not be satisfied) are called soft
sub-formulas. One MAX-SMT solution for ϕ is x = 5, y = 25 with cost 3.
    In the mSMQC model, all the Domain Restriction constraints, Ultra-
metric constraints and Quartet Consistency constraints have to be satisfied
(hard formulas). What may or may not be satisfied is that the variable
that encodes the consistency of the topology has value 1. Thus, the soft
sub-formula for the t-th topology is:
                                        (qt = 1)                          (18)
The cost of unsatisfying the sub-formula is set to one. The MAX-SMT solver
will try to minimize the cost of unsatisfying the topologies.


5        Experimental Results
This section analyzes the results obtained from running the existing ASP
solution [18], the PBO models described in Section 3, and the SMT-based
solutions proposed in Section 4. First, the experimental setup is described,
followed by the analysis of the experimental results for two sets of instances.
    The first set of instances were obtained from [18] and correspond to a
set of artificially generated instances. Each instance has 10 taxa and is
associated with a percentage of modified topologies (to account for errors).
The percentages considered were 1%, 5%, 10%, 15%, 20%, 25%, and 30%.
    The second set of instances was generated from a real example with 52
taxa1 . This example corresponds to a complete set of topologies inferred
    1
        This example was kindly provided by Dr. Guohui Lin.


                                            11
                           1%      5%      10%      15%      20%       25%       30%
   phy smodels            0.37   1.35     3.98     8.79     19.20     33.71     65.48
   phy clasp              3.43    5.32    12.61    23.42    57.25    129.63    275.65
   PBO-basic bsolo        5.20   23.24   134.14   339.01   863.75   2147.83   3371.67
   PBO-basic minisat+    14.04   14.77    29.03    53.15    69.60    119.50    244.12
   PBO-basic pueblo       6.89   13.46    31.56    81.06   259.58    497.87   1083.73
   PBO-reduce bsolo       2.00   17.94    67.68   255.80   908.31   2346.07   2605.87
   PBO-reduce minisat+    0.78    2.24     5.82    12.74    32.73     61.71    123.20
   PBO-reduce pueblo     61.77   90.24   141.93   242.97   343.44    742.47   1200.77
   PBO-unary bsolo        4.35   13.17    69.94   303.90   826.14   1906.28   3323.00
   PBO-unary minisat+     1.04    2.74     5.91    12.65    25.96     56.15    133.89
   PBO-unary pueblo       1.00    6.71    18.84    63.10   166.39    443.57   1144.69
   PBO-bin bsolo          2.66   15.17    37.67   140.28   397.74    746.46   2070.61
   PBO-bin minisat+       2.30    2.89     5.94    11.91    17.96    28.27     53.11
   PBO-bin pueblo         4.98   11.52    28.50    56.68   127.52    191.49    374.43
   PBO-bin-sc bsolo       4.91   16.17   102.70   163.43   363.35    759.69   1847.96
   PBO-bin-sc minisat+    1.96    3.71     5.45     9.57   16.27      29.36     58.88
   PBO-bin-sc pueblo      2.14   14.63    34.31    63.20    86.40    201.02    383.69
   SMT bottom-up         34.48   37.63    80.18   101.16   139.93    187.72    339.90
   SMT top-down           1.51    6.99    24.43    72.31   215.40    565.97   1382.56
   SMT mixed             10.77   15.63    23.43    46.09   101.06    171.83    470.86
   SMT max-smt           0.06     1.57     4.59    10.66    29.02     52.65     88.85


Table 1: Average running times in seconds for the first set of instances
(artificial - random generated) by percentage of modified topologies.


from a set of CCV distances. In order to compare the different models, the
instance was divided in 10 sets of 10 taxa each. For a considered set of taxa,
all the topologies on that set of taxa were gathered to form a new instance.
To introduce a higher number of errors, the same idea of modifying a number
of random topologies was used. The same percentages were used.
    The encoders/solvers used in the experiments can be divided in 3 cate-
gories, phy+ASP-solver, PBO-encoder+PBO-solver and SMT-based solver.
    phy+ASP-solver. phy is an encoder of the MQC problem into ASP,
which was obtained from [18]. The approach of using phy with an ASP-
solver is currently one of the best approaches in the literature. The ASP
encoder used in the experiments is derived from the encoder phy [18]. phy
receives as arguments, the number of taxa, the maximum number of allowed
quartet errors and the instance with the topologies. In the experiments the
maximum number of allowed errors was set to the total number of topologies.
    Two ASP-solvers were considered for the experiments, SMODELS [13]
and CLASP [6]. Other ASP-solvers were considered, but these two were
the only that allowed maximizing with phy. The maximization was done by
setting the ASP-solvers to enumerate all the stable models.
    PBO-encoder+PBO-solver. The PBO-encoder + PBO-solver is char-
acterized by first using one of the encoders described in Section 3 to obtain
a pbo file. The pbo file is then given to a pbo-solver. Three different PBO-
solvers were used in the experiments, the bsolo solver [9], the minisat+
solver [4] and the pueblo solver [12].
    SMT-based solver. The models based in SMT described in Section 4
require interaction with the SMT-solver. The SMT-solver used in the exper-


                                           12
                      0%      1%      5%       10%      15%       20%       25%       30%
phy smodels          1.24    1.44    2.50      7.00    13.40     23.59     44.03     69.50
phy clasp            1.99    2.26    4.11      9.17    17.25     45.58    105.82    242.34
PBO-basic bsolo      7.32    8.37   36.24    179.70   460.27    972.40   2473.14   3031.09
PBO-basic minisat+   9.12    4.89   17.48     23.95    56.37     61.42    110.26    177.85
PBO-basic pueblo     3.88    3.22   12.37     44.55   101.41    236.38    521.18   1252.61
PBO-reduce bsolo     1.70    2.90   18.34     84.74   262.27   1031.08   2718.71   3489.77
PBO-reduce minisat+ 1.48     1.65    1.98     4.43      9.89     18.60     39.04     73.86
PBO-reduce pueblo   93.21   82.51   88.10    167.56   245.20    379.72    638.59   1396.20
PBO-unary bsolo      2.81    3.88   23.47     95.88   372.25    788.38   2146.68   3162.33
PBO-unary minisat+   1.20    1.58    3.96      6.00     9.94     20.73     39.57     89.02
PBO-unary pueblo     2.05    1.49    4.44     20.65    55.54    168.44    544.00   1129.22
PBO-bin bsolo        3.09    6.01   16.61     85.99   171.04    352.50    892.75   1594.01
PBO-bin minisat+     2.17    2.19    4.13      5.84    10.09     16.98     29.50     46.17
PBO-bin pueblo       3.12    7.70    9.43     22.05    55.76     93.14    183.08    418.72
PBO-bin-sc bsolo     4.78    3.45   14.65     68.95   175.88    345.95    905.19   1823.66
PBO-bin-sc minisat+  2.89    2.14    5.04      7.55    7.84     15.60     22.54     42.97
PBO-bin-sc pueblo    2.57    3.69   17.04     26.36    74.35    120.68    209.84    421.12
SMT bottom-up       34.20   21.64   36.28     58.15   102.37    146.85    242.39    320.84
SMT top-down         0.46    1.38    6.84     24.94    77.48    231.31    580.47   1395.21
SMT mixed            0.46    9.58   13.32     18.93    47.50     99.73    201.33    436.74
SMT max-smt         0.04    0.05    1.48       4.44    10.91     27.91     51.90     86.17


Table 2: Average running times in seconds for the second set of instances
(real instances).


iments was the Yices solver [3]. Yices offers a C API that allows the inter-
action with the solver, where it is possible to ask the solver for a satisfying
assignment to the variables when the solver declares that the smt-instance
is satisfiable. Yices is also able to solve max-smt instances.
    The results were obtained in a set of nine Intel Xeon 5160, 3GHz dual
core servers with 4GB of RAM. The timeout was set to 3600 seconds. Ta-
ble 1 presents the results for the first set of instances. The percentages of
modified topologies are shown in the top of the columns, and the encoders
and solvers used are indicated on the left column. In terms of timeouts,
the only timeouts were for instances with 20% or 30%. For 20%, the pbo-
binary model has timed-out with bsolo and pueblo in one instance, and the
pbo-bin-sc model with pueblo has timed-out in one instance. For 30%, the
pbo-basic has timed-out seven times with bsolo and once with pueblo. The
pbo-reduce with bsolo timeout nine times and pbo-unary has timed out five
times with bsolo and once with pueblo.
    FromTable 1 and omparing the results of phy with both the ASP-solvers,
we can conclude that smodels is on average four times faster than clasp for
these instances. Comparing the results among the PBO-models, minisat+
stands out as being the fastest in these instances. The difference among the
PB-encoders is not as clear. For instances with percentage of 5% or smaller,
the pbo-reduced is the fastest model. On the other hand on instances with
percentage of 10% or higher, then the binary models are faster. For the
SMT-based approaches, the max-smt approach is the most efficient.
    By analyzing all the models and solvers, we can conclude that for in-
stances with percentage of 1% smt max-smt is faster than any other solver.


                                            13
For instances with percentage of 5%, 10% or 15% the phy + smodels ap-
proach has better performance. For instances with 20% or higher the pbo-
binary and pbo-binary-sc with minisat+ present better results.
    Table 2 shows for the second set of instances the average running time of
the instances per encoder/solver and percentage of modified topologies. The
timeouts for these instances occurred for 1%, 25% and 30%. The pbo-binary
model with bsolo timeout once for 1%. The pbo-basic with bsolo timed-out
once for 25% and nine times for 30%. The pbo-unary with bsolo timeout
once for 25% and seven times for 30%. The pbo-reduced with bsolo timeout
nine times for 30%. For instances that have 0%, 1% or 5%, smt max-smt is
the solver with best performance results, whereas if the instance has 10% of
modified topologies, the pbo-reduced encoder with minisat+ is the fastest.
For instances with 15% or more, the pbo-binary-sc encoder with minisat+
presents the best performance results.

6    Conclusions
This paper proposes new approaches for solving the Maximum Quartet Con-
sistency (MQC) problem. Building on recent work on using Pseudo-Boolean
Optimization (PBO) models for MQC [10], a number of new alternative PBO
models is proposed. In addition, the paper shows how maximization vari-
ants of Satisfiability Modulo Theories (SMT) can be used for solving the
MQC problem. To our best knowledge, this is the first concrete example of
using SMT for solving computational problems in bioinformatics.
    The experimental results suggest that the SMT models are the most ad-
equate for instances with a small percentage of modified quartet topologies,
whereas for instances with a larger percentage of modified quartet topologies
the most effective solution is currently based on PBO using the Minisat+
solver. The results in the paper provide evidence that there are a number
of effective alternative approaches for solving the MQC problem, including
ASP, PBO and SMT.
    Future research directions will include better integration of SMT solvers
in addressing the MQC problem.

Acknowledgments
This work was supported in part by Microsoft Research through its PhD Scholarship
Programme.


References
 [1] A. Ben-Dor, B. Chor, D. Graur, O. R., and D. Pelleg. From four-taxon trees
     to phylogenies (preliminary report): the case of mammalian evolution. In Int.
     Conf. on Computational Molecular Biology (Recomb98), pages 9–19, 1998.
 [2] V. Berry, T. Jiang, P. Kearney, M. Li, and T. Wareham. Quartet cleaning:
     Improved algorithms and simulations. European Symposium on Algorithms,
     1643:313–324, 1999.

                                       14
 [3] B. Dutertre and L. de Moura. The yices smt solver. Technical report, Computer
     Science Laboratory, SRI International, available at http://yices.csl.sri.com.
 [4] N. Eén and N. Sorensson. Translating pseudo-boolean constraints into sat.
     Journal of Satisfiability, Boolean Modeling and Computation, 2:1–26, 2006.
     Available at http://minisat.se/MiniSat+.html.
 [5] P. Erdos, M. Steel, L. Szekely, and T. Warnow. Constructing big trees from
     short sequences. Int. Colloquium on Automata, Languages, and Programming,
     1997.
 [6] M. Gebser, B. Kaufmann, A. Neumann, and T. Schaub. clasp : A conflict-
     driven answer set solver. In Int. Conf. on Logic Programming and Nonmono-
     tonic Reasoning (LPNMR2007), volume 4483, pages 260–265, 2007. Available
     at http://www.cs.uni-postdam.de/clasp.
 [7] J. Gramm and R. Niedermeier. A fixed-parameter algorithm for minimum
     quartet inconsistency. Computer and System Sciences, 67(4):723–741, Decem-
     ber 2003.
 [8] T. Jiang, P. Kearney, and M. Li. Orchestrating quartets: approximation and
     data correction. Symp. on Foundations of Computer Science, pages 416–425,
     1998.
 [9] V. Manquinho and J. P. Marques-Silva. Satisfiability-based algorithms for
     boolean optimization. Annals of Mathematics and Artificial Intelligence,
     (40):3–4, 2004. Available at http://sat.inesc-id.pt/bsolo/.
[10] A. Morgado and J. Marques-Silva. A pseudo-boolean solution to the maxi-
     mum quartet consistency problem. WCB08 - Workshop on Constraint Based
     Methods for Bioinformatics, May 2008.
[11] R. Nieuwenhuis and A. Oliveras. On SAT Modulo Theories and Optimization
     Problems. In Theory and Applications of Satisfiability Testing, pages 156–169,
     2006.
[12] H. Sheini and K. Sakallah. Pueblo: A modern pseudo-boolean sat solver. In
     Design, Automation and Test in Europe (DATE’05), volume 2, pages 684 –
     685, 2005.
[13] P. Simons.      Computing the stable model semantics. Available at
     http://www.tcs.hut.fi/software/smodels/.
[14] C. Sinz. Towards an optimal cnf encoding of boolean cardinality constraints.
     In Principles and Practice of Constraint Programming, pages 827–831, 2005.
[15] G. Wu, M. Kao, G. Lin, and J. You. Reconstructing phylogenies from noisy
     quartets in polynomial time with a high success probability. Algorithms for
     Molecular Biology, 3(1), 2008.
[16] G. Wu, G. Lin, and J. You. Quartet based phylogeny reconstruction with an-
     swer set programming. Int. Conf. on Tools with Artificial Intelligence, 00:612–
     619, 2004.
[17] G. Wu, J. You, and G. Lin. A Lookahead Branch-and-Bound Algorithm for
     the Maximum Quartet Consistency Problem. Workshop on Algorithms in
     Bioinformatics, pages 65–76, October 2005.
[18] G. Wu, J. You, and G. Lin. Quartet-based phylogeny reconstruction with
     answer set programming. IEEE/ACM Trans. Comput. Biol. Bioinformatics,
     4(1):139–152, 2007.


                                        15