Il Milione: A Journey in the Computational Logic in Italy


Recenti progressi basati su Programmazione Logica e/o a Vincoli
               sulla soluzione del protein folding
  Recent Constraint/Logic Programming based advances in the
            solution of the Protein Folding Problem
                                Agostino Dovier


                                           1
                           Il Milione: A Journey in the Computational Logic in Italy


SOMMARIO/ABSTRACT                                                  in 2006, and with CPAIOR in 2008 (see, e.g., http:
                                                                   //wcb08.dimi.uniud.it); we have organized the
In questo articolo desidero illustrare il contributo del mio       International Summer Schools BCI (Biology, Communi-
gruppo di ricerca alla disciplina Bioinformatica, con par-         cation, and Information) in Dobbiaco and Trieste (see,
ticolare riferimento alla risoluzione del problema della           e.g., http://bioinf.dimi.uniud.it/bci2006;
predizione di struttura di una proteina usando metodologie         and we have been guest editors of a special issue of the
di programmazione logica e a vincoli.                              journal Constraints on these topics [17].
In this paper, we summarize the contribution to Bioinfor-             As far as the technical contribution is concerned, we
matics of our research group. In particular, we will present       have worked on the Protein Structure Prediction problem
our approach to the solution of the protein structure pre-         using, whenever possible, techniques coming from logic
diction problem based on constraint/logic programming              programming and constraint programming. In the rest of
techniques.                                                        this paper we briefly introduce this challenging problem
                                                                   and give an overview of our results.
Keywords: Logic Programming, Constraint Program-
ming, Bioinformatics                                               2    The Protein Structure Prediction Problem

1   Introduction                                                   The Primary structure of a protein is a linked sequence
                                                                   of aminoacids. There are 20 kinds of aminoacids, iden-
In the last years we have witnessed the birth and the rapid        tified by a letter. For the scope of this paper, the pri-
growth of a new research area whose results have a posi-           mary structure of a protein is a string s1 · · · sn with si ∈
tive impact on traditional and fundamental disciplines such        {A, . . . , Z} \ {B, J, O, U, X, Z}.
as biology, chemistry, physics, medicine, agriculture, or              The Tertiary Structure (native state) of the protein is a
industry (briefly denoted globally as “Bio”). This area,           3D conformation associated to the primary structure. The
known as Bioinformatics uses algorithms and method-                protein structure prediction problem is the problem of pre-
ological techniques developed by Computer Sciences to              dicting the tertiary structure, given the primary structure.
solve challenging problems in “Bio” areas. Moreover, new               The Tertiary Structure usually assumes two types of lo-
emerging problems produce stimuli for Computer Sciences            cal conformation: α-helices and β-sheets. In Figure 1 we
to develop new algorithms and methods. Bioinformatics              report the primary and the tertiary structure of the protein
deals with recognition, analysis, and organization of DNA          2K2P deposited in April 2008. In the top figure all atoms
sequences, with biological systems simulations, with prob-         of the amino acids are represented. In the lower figure we
lems of prediction of the spatial conformation of a biolog-        report the abstract structure obtained linking the Cα atoms
ical polymer, among others.                                        (briefly, a central atom of each aminoacid). With this ab-
   We have worked in this field in the last years with the         straction the secondary structure elements (three β-sheets
double effort of solving real problems and of spreading            and two α-helices) are evident.
known techniques, methods, and languages to “Bio” re-                  Let D be a set of admissible points for the amino acids.
searchers.                                                         Let c, d two fixed distances. For two points p, q ∈ D, we
   In this spirit, we have been organizers of the work-            say that next(p, q) if and only if |p − q| = d.1 For two
shops WCB (Constraint-Based Methods for Bioinformat-                  1 For real proteins, d = 3.8Å corresponding to the distance between

ics) associated with ICLP in 2005 and 2007, with CP                two consecutive Cα in the sequence

                                                               2
                                   Il Milione: A Journey in the Computational Logic in Italy


   A G L S F H V E D M T C G H C A G V I K G                       Three are the main contact energy models used in litera-
   A I E K T V P G A A V H A D P A S R T V V
   V G G V S D A A H I A E I I T A A G Y T P E                  ture for Pot: the HP model [19], the HPNX model [4], and
                                                                the 20x20 model [6].

                                                                3       Related Work
                                                                In the HP model [19], amino acids are split in two fam-
                                                                ilies: hydrophobic (H) and polar (P). Two hydrophobic
                                                                amino acids in contact contribute -1 to the energy. The
                                                                other contacts are not relevant. The NP-completeness even
                                                                in the simple spatial model constituted by the N2 lattice2 is
                                                                proved in [9]. In particular, it is proved that the problem:
                                                                Given a sequence of P and H, stating the existence of a
                                                                folding with at least k contacts between H is NP-complete.
                                                                   Backofen and Will solved this problem using constraint
                                                                programming for protein of length 160 and more on the
                                                                FCC (see [3, 1, 2]). Efficiency is obtained using a clever
                                                                symmetry breaking and the notion of core. Basically, the
                                                                folding is analyzed layer by layer and the various con-
                                                                formations of each layer that maximize contacts are pre-
                                                                computed. This kind of approach is unapplicable to a
                                                                more detailed energy models and with the adding of other
Figure 1: Primary and Tertiary structures (all-atoms and        structural constraints (e.g., known α-helices and β-sheets).
Cα –Cα structure) of Protein 2K2P (amino acids 22–85).          Slightly more complex energy models have been proposed
Observe the presence of 2 α-helices (in red—dark gray)          by the same group for the protein structure prediction prob-
and 3 β-sheets (in cyan—light gray)                             lem. In [4] they consider an energy model in which amino
                                                                acids are split into 4 families. Other researchers (e.g. [23])
                                                                instead approximated the solution to the same problem us-
points p, q ∈ D, we define the Boolean function contact
                                                                ing local search and refined meta-heuristics.
as follows: contact(p, q) = 1 if and only if |p − q| ≤ c.
                                                                   Barahona and Kripphal, instead, work on off-lattice
   A function ω : {1, . . . , n} −→ D is said a folding if
                                                                space model where space is discretized into small cubes.
  • for i, j ∈ {1, . . . , n} if i 6= j then ωi 6= ωj           They also deal with protein docking and develop the tool
                                                                Chemera, commonly used by biochemists in their re-
  • for i ∈ {1, . . . , n − 1} it holds that next(ωi , ωi+1 )   search [22, 5].
  Let Pot be a function from pairs of amino acids to inte-
ger numbers. The free energy of a folding E(ω) is com-          4       Our Contribution
puted as follows:
                   X                                            In all our works, we have used FCC as the space model,
   E(ω) =                   contact(ωi , ωj )Pot(si , sj )      and the 20x20 statistical potential contact energy model
                   1 ≤ i < n
                  i+2 ≤ j ≤ n
                                                                presented in [6].

   The protein structure prediction problem (PSP) is the        CLP(FD) encoding. In [20] we encoded the problem
problem of determining the folding(s) ω with minimum en-        using the library clpfd of SICStus Prolog. Since con-
ergy. The problem contains some symmetries that can be          tact energy is not suited to predict helices and sheets in
avoided by symmetry breaking search (see e.g. [2]). The         the FCC lattice, we pre-computed secondary structure ele-
simplest way to remove some symmetries is to fix the po-        ments (α-helices and β-strands) using other well-known
sitions of the first two points (ω1 and ω2 ).                   tools. The results of these pre-computations were then
   Two main approximations can be made: (1) space: the          used as constraints within the main code. In this first
set of admissible points, and (2) energy: the details of        encoding the number of admissible angles for secondary
the Potential function used. It is well-known that lattice-     structure elements was too limited. We relaxed this restrain
based models are realistic approximations of the set of         in [10] where a more general and precise handling of sec-
the admissible points for the Cα atoms of a protein [24].       ondary structure constraints was implemented. However,
Lattices are basically 3D graphs with repeated patterns.        the exponential growth of the search space w.r.t. protein
For instance the face centered cube (FCC) lattice is de-        length made impossible to explore the whole search space
                                   3
fined as: D = {(x, y, z) ∈ N      √ : x+y+√    z is even},
                    2
E = {(p, q) ∈ D : |p − q| = 2}. Thus, d = 2, c = 2.                 2 I.e., D = N2 , E = {(p, q) ∈ D 2   : |p − q| = 1}, c = d = 1.

                                                                    3
                            Il Milione: A Journey in the Computational Logic in Italy


even using state-of-the-art constraint solvers for proteins                ID–n      [20]     [10]     [11]    [13]    [16]
of length greater than 30/40. Therefore, we proposed an                  1LE3–16    12.5m      5s      2.5s    1.5s     0.5s
ad-hoc labeling search with biologically motivated heuris-               1ZDD–34     47m      41s     17.5s     2m      0.1s
tics and we introduced data structure (potential matrix) that            2GP8–40     6.5h      9m     10.5h    1.5m     0.5s
allowed us to reduce calculations during this phase. This                1ENH–54     3.5h     13m      24h     55m     49.5s
approach was then extended by relaxing some constraints
and developing other search heuristics [11].                         Figure 2: Running time of the various approaches on some
   In all these approaches we used a double representation           small proteins
for the tertiary structure: a cartesian one, based on the set
of points, and a polar one, based on the torsional angles            rightmost). The solutions found with various techniques
generated by the protein during the folding. The carte-              are not always the same, but (save for the first column re-
sian representation is useful for defining the notion of self-       lated to a too strict encoding) they have comparable energy
avoiding walk and the notion of constraint-based energy              and form. And, more important, the form is very close to
function. The polar representation simplifies the encoding           their real tertiary structure. The protein 2K2P of Figure 1
of secondary structure constraints. However, a lot of extra          is predicted by COLA 3.1 with BBF in less than one hour.
constraints need to be introduced to manage the conver-
sion between the two representations. This badly scales              Towards generalition and integration. The ab-initio
on large proteins (the constraint solvers used were close            approach used by COLA is still computationally infeasi-
to their memory limit for protein of length 60). Thus we             ble when applied to the prediction of protein structures
decided (in [13]) to abandon the polar representation and            with more than hundred amino acids. Only the presence of
to impose secondary structure constraints only using carte-          other kind of partial information (e.g., known folds for sub-
sian constraints. This way, we loose the chirality property          blocks picked from the protein data bank) can speed-up
of helices but the overall definition becomes simpler.               significantly the search. This is however in line with what
   In the same paper we also developed a search heuris-              done by other prediction tools (like e.g. ROSETTA), where
tics (Bounded Block Fail—BBF). The list of variables is              partial information is picked from the protein data-bank
dynamically split into blocks of k variables that will be la-        from similar structures/substructures and only small subse-
beled together. When the variables in the block Bi are in-           quences need to be arranged. Thus, we have started a sys-
stantiated to an apparently admissible solution, the search          tematic study of what kind of global constraints are needed
moves to the successive block Bi+1 , if any. If the labeling         in a solver for lattice models structure predictions. In par-
of the block Bi+1 fails, the search backtracks to the block          ticular we have studied the definition and the complex-
Bi . Now, there are two options: if the number of times              ity of testing satisfiability and applying propagation for
that Bi+1 has failed is below a certain threshold, then the          the constraints alldifferent, contiguous, self
process continues, by generating one more solution to Bi             avoiding walk, alldistant, chain, and rigid
and re-entering Bi+1 . Otherwise, the heuristics generates           block constraint in [14]; we have studied a global con-
a failure for Bi as well and backtracks to Bi−1 . The key            straint that accounts for partial information coming from
idea is that small local changes do not change too much              density maps in [15]. These global constraints will be in-
the form of a protein. When we tried a sufficient number             corporated in COLA so as to obtain a tool able to profit
of close conformations and we fail, we can freely abandon            as much as possible of partial information coming from
that research branch (with fail we consider either no admis-         known proteins and from partial predictions.
sible foldings or admissible foldings with energy greater               We have also studied how to use model checking re-
than the local minimum already found).                               sults for analyzing the folding process [18] and how to
                                                                     model the protein folding problem as a planning problem
Ad-hoc constraint solver. In [12] we developped an ad-               using a variant of the well-known action description lan-
hoc constraint solver written in C, named COLA (COn-                 guage B [21]. An alternative approach to the protein fold-
straint solving on LAttices). In the previous approaches             ing problem based on Agent-Based simulation is proposed
each 3D point was viewed as a triple of FD variables                 in [7].
hX, Y, Zi. In COLA, instead, the lattice point is an el-
ementary element, associated with a 3D domain (a box).               5    Conclusions and future work
We developed and implemented ad-hoc constraint propa-
gation techniques and the BBF heuristics. This approach              This work represents a typical use of logic programming
with a further parallelization was then presented in [16].           paradigm for problem solving. The problem can be en-
   Just to give a taste of the evolution of our proposals,           coded easily and solutions (for small inputs) can be com-
we report the running times of the systems on the predic-            puted by built-in mechanisms of (constraint) logic pro-
tion of some small proteins in Figure 2. Timings are taken           gramming. Heuristics and alternative encodings can be
from the published papers (the machine used for the left-            easily programmed and tested. When the encoding be-
most column is roughly 3x slower than the machine for the            comes stable, speed-up can be obtained by less declara-

                                                                 4
                                Il Milione: A Journey in the Computational Logic in Italy


tive methods. The results obtained are promising for the       [11] A. Dal Palù, A. Dovier, and F. Fogolari. Constraint
success of the application of the same approach to other            logic programming approach to protein structure
challenging problems of Bioinformatics.                             prediction. BMC Bioinformatics 5(186), 2004.
                                                               [12] A. Dal Palù, A. Dovier, and E. Pontelli. A Constraint
Acknowledgements. The research summarized in this                   Logic Programming Approach to 3D Structure De-
paper would not have been possible without the help of              termination of Large Protein Complexes. Proc. of
my valuable colleagues and real friends Alessandro Dal              LPAR05, pp. 48–63, 2005.
Palù, Federico Fogolari, and Enrico Pontelli. I would also
like to thank Luca Bortolussi, Elisabetta De Maria, Andrea     [13] A. Dal Palù, A. Dovier, and E. Pontelli. Heuristics,
Formisano, Angelo Montanari, and Carla Piazza for their             Optimizations, and Parallelism for Protein Structure
collaboration. The research has been partially supported            Prediction in CLP(FD). Proc. of PPDP05, pp. 230–
by the FIRB RBNE03B8KK, and by PRIN and GNCS                        241, ACM, Lisbon 2005.
projects.
                                                               [14] A. Dal Palù, A. Dovier, and E. Pontelli. Global
                                                                    constraints for Discrete Lattices. Proc. of WCB06,
REFERENCES                                                          Nantes, pp. 55–68, 2006.
  [1] R. Backofen. The protein structure prediction prob-      [15] A. Dal Palù, A. Dovier, and E. Pontelli. The den-
      lem: A constraint optimization approach using a               sity constraint. Proc. of WCB07, Porto, pp. 10–19,
      new lower bound. Constraints 6:223–255, 2001.                 2007.
  [2] R. Backofen and S. Will. Excluding Symmetries in         [16] A. Dal Palù, A. Dovier, and E. Pontelli. A constraint
      Constraint-Based Search. Constraints 7(3–4):333–              solver for discrete lattices, its parallelization, and
      349, 2002.                                                    application to protein structure prediction. Software
  [3] R. Backofen and S. Will. A Constraint-Based Ap-               Practice and Experience, DOI: 10.1002/spe.
      proach to Fast and Exact Structure Prediction in 3-           810.
      Dimensional Protein Models. Constraints 11(1):5–         [17] A. Dal Palù, A. Dovier, and S. Will (eds.) Special
      30, 2006.                                                     issue on Constraint Based Methods for Bioinformat-
  [4] R. Backofen, S. Will, and E. Bornberg-Bauer. Ap-              ics. Constraints 13(1–2), 2008.
      plication of constraint programming techniques for       [18] E. De Maria, A. Dovier, A. Montanari, and C. Pi-
      structure prediction of lattice proteins with extended        azza. Exploiting Model Checking in Constraint-
      alphabets. Bioinformatics 15(3): 234-242, 1999.               based Approaches to the Protein Folding. Proc. of
  [5] P. Barahona and L. Krippahl. Constraint Program-              WCB06, pp.46-54, Nantes, 2006.
      ming in Structural Bioinformatics. [17]:3–20.            [19] K. A. Dill. Dominant forces in protein folding. Bio-
  [6] M. Berrera, H. Molinari, and F. Fogolari. Amino               chemistry 29:7133-7155, 1990.
      acid empirical contact energy definitions for fold       [20] A. Dovier, M. Burato, and F. Fogolari. Using Sec-
      recognition in the space of contact maps. BMC                 ondary Structure Information for Protein Folding in
      Bioinformatics 4(8), 2003.                                    CLP(FD). Proc. of WFLP02, ENTCS 76, 2002.
  [7] L. Bortolussi, A. Dovier, and F. Fogolari. Agent-        [21] A. Dovier, A. Formisano, and E. Pontelli. Mul-
      based Protein Structure Prediction. Multiagent and            tivalued Action Languages with Constraints in
      Grid Systems 3(2):183–197, 2007.                              CLP(FD). Proc. of ICLP07, LNCS 4670, pp. 255–
  [8] R. Cipriano, A. Palù, and A. Dovier. A hybrid                270, 2007.
      approach mixing local search and constraint pro-         [22] L. Krippahl and P. Barahona. PSICO: Solving Pro-
      gramming applied to the protein structure prediction          tein Structures with Constraint Programming and
      problem. Proc. of WCB08, Paris, 2008.                         Optimisation. Constraints, 7:317–331, 2002.
  [9] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Pic-      [23] A. Shmygelska and H. H. Hoos. An ant colony opti-
      colboni, and M. Yannakakis. On the Complexity of              misation algorithm for the 2D and 3D hydrophobic
      Protein Folding, Journal of Computational Biology,            polar protein folding problem. BMC Bioinformatics
      5(3):423–466, 1998.                                           6(30), 2005.
 [10] A. Dal Palù, A. Dovier, and F. Fogolari. Protein        [24] J. Skolnick and A. Kolinski. Reduced models of
      Folding in CLP(FD) with Empirical Contact En-                 proteins and their applications. Polymer, 45:511–
      ergies. Proc. of CSCLP03:250–265, LNCS 3010,                  524, 2004.
      2004.

                                                                5
                          Il Milione: A Journey in the Computational Logic in Italy


6   Contacts
Agostino Dovier
Dip. di Matematica e Informatica
Univ. di Udine
Via delle Scienze 206, 33100 Udine (UD)
Tel: +39 0432 558494
E-mail: dovier@dimi.uniud.it

7   Biography
Agostino Dovier received his PhD in Computer Science from the University of Pisa in 1996 and he is an Associate
Professor of Computer Science at the University of Udine. His current research interests include the development and the
application of declarative programming languages with constraints and Bioinformatics. He is member of AI*IA and of
the EC of GULP and ALP and he has published over 60 international referred publications. He served as program chair
or in the program committee of several conferences and workshops of logic and constraint programming, as guest editor
of special issues of international journals, and he has coordinated some research projects in the area of Constraint Logic
Programming and Bioinformatics. He is the general chair of ICLP08 (International Conference on Logic Programming).


                                                            6