Il Milione: A Journey in the Computational Logic in Italy Recenti progressi basati su Programmazione Logica e/o a Vincoli sulla soluzione del protein folding Recent Constraint/Logic Programming based advances in the solution of the Protein Folding Problem Agostino Dovier 1 Il Milione: A Journey in the Computational Logic in Italy SOMMARIO/ABSTRACT in 2006, and with CPAIOR in 2008 (see, e.g., http: //wcb08.dimi.uniud.it); we have organized the In questo articolo desidero illustrare il contributo del mio International Summer Schools BCI (Biology, Communi- gruppo di ricerca alla disciplina Bioinformatica, con par- cation, and Information) in Dobbiaco and Trieste (see, ticolare riferimento alla risoluzione del problema della e.g., http://bioinf.dimi.uniud.it/bci2006; predizione di struttura di una proteina usando metodologie and we have been guest editors of a special issue of the di programmazione logica e a vincoli. journal Constraints on these topics [17]. In this paper, we summarize the contribution to Bioinfor- As far as the technical contribution is concerned, we matics of our research group. In particular, we will present have worked on the Protein Structure Prediction problem our approach to the solution of the protein structure pre- using, whenever possible, techniques coming from logic diction problem based on constraint/logic programming programming and constraint programming. In the rest of techniques. this paper we briefly introduce this challenging problem and give an overview of our results. Keywords: Logic Programming, Constraint Program- ming, Bioinformatics 2 The Protein Structure Prediction Problem 1 Introduction The Primary structure of a protein is a linked sequence of aminoacids. There are 20 kinds of aminoacids, iden- In the last years we have witnessed the birth and the rapid tified by a letter. For the scope of this paper, the pri- growth of a new research area whose results have a posi- mary structure of a protein is a string s1 · · · sn with si ∈ tive impact on traditional and fundamental disciplines such {A, . . . , Z} \ {B, J, O, U, X, Z}. as biology, chemistry, physics, medicine, agriculture, or The Tertiary Structure (native state) of the protein is a industry (briefly denoted globally as “Bio”). This area, 3D conformation associated to the primary structure. The known as Bioinformatics uses algorithms and method- protein structure prediction problem is the problem of pre- ological techniques developed by Computer Sciences to dicting the tertiary structure, given the primary structure. solve challenging problems in “Bio” areas. Moreover, new The Tertiary Structure usually assumes two types of lo- emerging problems produce stimuli for Computer Sciences cal conformation: α-helices and β-sheets. In Figure 1 we to develop new algorithms and methods. Bioinformatics report the primary and the tertiary structure of the protein deals with recognition, analysis, and organization of DNA 2K2P deposited in April 2008. In the top figure all atoms sequences, with biological systems simulations, with prob- of the amino acids are represented. In the lower figure we lems of prediction of the spatial conformation of a biolog- report the abstract structure obtained linking the Cα atoms ical polymer, among others. (briefly, a central atom of each aminoacid). With this ab- We have worked in this field in the last years with the straction the secondary structure elements (three β-sheets double effort of solving real problems and of spreading and two α-helices) are evident. known techniques, methods, and languages to “Bio” re- Let D be a set of admissible points for the amino acids. searchers. Let c, d two fixed distances. For two points p, q ∈ D, we In this spirit, we have been organizers of the work- say that next(p, q) if and only if |p − q| = d.1 For two shops WCB (Constraint-Based Methods for Bioinformat- 1 For real proteins, d = 3.8Å corresponding to the distance between ics) associated with ICLP in 2005 and 2007, with CP two consecutive Cα in the sequence 2 Il Milione: A Journey in the Computational Logic in Italy A G L S F H V E D M T C G H C A G V I K G Three are the main contact energy models used in litera- A I E K T V P G A A V H A D P A S R T V V V G G V S D A A H I A E I I T A A G Y T P E ture for Pot: the HP model [19], the HPNX model [4], and the 20x20 model [6]. 3 Related Work In the HP model [19], amino acids are split in two fam- ilies: hydrophobic (H) and polar (P). Two hydrophobic amino acids in contact contribute -1 to the energy. The other contacts are not relevant. The NP-completeness even in the simple spatial model constituted by the N2 lattice2 is proved in [9]. In particular, it is proved that the problem: Given a sequence of P and H, stating the existence of a folding with at least k contacts between H is NP-complete. Backofen and Will solved this problem using constraint programming for protein of length 160 and more on the FCC (see [3, 1, 2]). Efficiency is obtained using a clever symmetry breaking and the notion of core. Basically, the folding is analyzed layer by layer and the various con- formations of each layer that maximize contacts are pre- computed. This kind of approach is unapplicable to a more detailed energy models and with the adding of other Figure 1: Primary and Tertiary structures (all-atoms and structural constraints (e.g., known α-helices and β-sheets). Cα –Cα structure) of Protein 2K2P (amino acids 22–85). Slightly more complex energy models have been proposed Observe the presence of 2 α-helices (in red—dark gray) by the same group for the protein structure prediction prob- and 3 β-sheets (in cyan—light gray) lem. In [4] they consider an energy model in which amino acids are split into 4 families. Other researchers (e.g. [23]) instead approximated the solution to the same problem us- points p, q ∈ D, we define the Boolean function contact ing local search and refined meta-heuristics. as follows: contact(p, q) = 1 if and only if |p − q| ≤ c. Barahona and Kripphal, instead, work on off-lattice A function ω : {1, . . . , n} −→ D is said a folding if space model where space is discretized into small cubes. • for i, j ∈ {1, . . . , n} if i 6= j then ωi 6= ωj They also deal with protein docking and develop the tool Chemera, commonly used by biochemists in their re- • for i ∈ {1, . . . , n − 1} it holds that next(ωi , ωi+1 ) search [22, 5]. Let Pot be a function from pairs of amino acids to inte- ger numbers. The free energy of a folding E(ω) is com- 4 Our Contribution puted as follows: X In all our works, we have used FCC as the space model, E(ω) = contact(ωi , ωj )Pot(si , sj ) and the 20x20 statistical potential contact energy model 1 ≤ i < n i+2 ≤ j ≤ n presented in [6]. The protein structure prediction problem (PSP) is the CLP(FD) encoding. In [20] we encoded the problem problem of determining the folding(s) ω with minimum en- using the library clpfd of SICStus Prolog. Since con- ergy. The problem contains some symmetries that can be tact energy is not suited to predict helices and sheets in avoided by symmetry breaking search (see e.g. [2]). The the FCC lattice, we pre-computed secondary structure ele- simplest way to remove some symmetries is to fix the po- ments (α-helices and β-strands) using other well-known sitions of the first two points (ω1 and ω2 ). tools. The results of these pre-computations were then Two main approximations can be made: (1) space: the used as constraints within the main code. In this first set of admissible points, and (2) energy: the details of encoding the number of admissible angles for secondary the Potential function used. It is well-known that lattice- structure elements was too limited. We relaxed this restrain based models are realistic approximations of the set of in [10] where a more general and precise handling of sec- the admissible points for the Cα atoms of a protein [24]. ondary structure constraints was implemented. However, Lattices are basically 3D graphs with repeated patterns. the exponential growth of the search space w.r.t. protein For instance the face centered cube (FCC) lattice is de- length made impossible to explore the whole search space 3 fined as: D = {(x, y, z) ∈ N √ : x+y+√ z is even}, 2 E = {(p, q) ∈ D : |p − q| = 2}. Thus, d = 2, c = 2. 2 I.e., D = N2 , E = {(p, q) ∈ D 2 : |p − q| = 1}, c = d = 1. 3 Il Milione: A Journey in the Computational Logic in Italy even using state-of-the-art constraint solvers for proteins ID–n [20] [10] [11] [13] [16] of length greater than 30/40. Therefore, we proposed an 1LE3–16 12.5m 5s 2.5s 1.5s 0.5s ad-hoc labeling search with biologically motivated heuris- 1ZDD–34 47m 41s 17.5s 2m 0.1s tics and we introduced data structure (potential matrix) that 2GP8–40 6.5h 9m 10.5h 1.5m 0.5s allowed us to reduce calculations during this phase. This 1ENH–54 3.5h 13m 24h 55m 49.5s approach was then extended by relaxing some constraints and developing other search heuristics [11]. Figure 2: Running time of the various approaches on some In all these approaches we used a double representation small proteins for the tertiary structure: a cartesian one, based on the set of points, and a polar one, based on the torsional angles rightmost). The solutions found with various techniques generated by the protein during the folding. The carte- are not always the same, but (save for the first column re- sian representation is useful for defining the notion of self- lated to a too strict encoding) they have comparable energy avoiding walk and the notion of constraint-based energy and form. And, more important, the form is very close to function. The polar representation simplifies the encoding their real tertiary structure. The protein 2K2P of Figure 1 of secondary structure constraints. However, a lot of extra is predicted by COLA 3.1 with BBF in less than one hour. constraints need to be introduced to manage the conver- sion between the two representations. This badly scales Towards generalition and integration. The ab-initio on large proteins (the constraint solvers used were close approach used by COLA is still computationally infeasi- to their memory limit for protein of length 60). Thus we ble when applied to the prediction of protein structures decided (in [13]) to abandon the polar representation and with more than hundred amino acids. Only the presence of to impose secondary structure constraints only using carte- other kind of partial information (e.g., known folds for sub- sian constraints. This way, we loose the chirality property blocks picked from the protein data bank) can speed-up of helices but the overall definition becomes simpler. significantly the search. This is however in line with what In the same paper we also developed a search heuris- done by other prediction tools (like e.g. ROSETTA), where tics (Bounded Block Fail—BBF). The list of variables is partial information is picked from the protein data-bank dynamically split into blocks of k variables that will be la- from similar structures/substructures and only small subse- beled together. When the variables in the block Bi are in- quences need to be arranged. Thus, we have started a sys- stantiated to an apparently admissible solution, the search tematic study of what kind of global constraints are needed moves to the successive block Bi+1 , if any. If the labeling in a solver for lattice models structure predictions. In par- of the block Bi+1 fails, the search backtracks to the block ticular we have studied the definition and the complex- Bi . Now, there are two options: if the number of times ity of testing satisfiability and applying propagation for that Bi+1 has failed is below a certain threshold, then the the constraints alldifferent, contiguous, self process continues, by generating one more solution to Bi avoiding walk, alldistant, chain, and rigid and re-entering Bi+1 . Otherwise, the heuristics generates block constraint in [14]; we have studied a global con- a failure for Bi as well and backtracks to Bi−1 . The key straint that accounts for partial information coming from idea is that small local changes do not change too much density maps in [15]. These global constraints will be in- the form of a protein. When we tried a sufficient number corporated in COLA so as to obtain a tool able to profit of close conformations and we fail, we can freely abandon as much as possible of partial information coming from that research branch (with fail we consider either no admis- known proteins and from partial predictions. sible foldings or admissible foldings with energy greater We have also studied how to use model checking re- than the local minimum already found). sults for analyzing the folding process [18] and how to model the protein folding problem as a planning problem Ad-hoc constraint solver. In [12] we developped an ad- using a variant of the well-known action description lan- hoc constraint solver written in C, named COLA (COn- guage B [21]. An alternative approach to the protein fold- straint solving on LAttices). In the previous approaches ing problem based on Agent-Based simulation is proposed each 3D point was viewed as a triple of FD variables in [7]. hX, Y, Zi. In COLA, instead, the lattice point is an el- ementary element, associated with a 3D domain (a box). 5 Conclusions and future work We developed and implemented ad-hoc constraint propa- gation techniques and the BBF heuristics. This approach This work represents a typical use of logic programming with a further parallelization was then presented in [16]. paradigm for problem solving. The problem can be en- Just to give a taste of the evolution of our proposals, coded easily and solutions (for small inputs) can be com- we report the running times of the systems on the predic- puted by built-in mechanisms of (constraint) logic pro- tion of some small proteins in Figure 2. Timings are taken gramming. Heuristics and alternative encodings can be from the published papers (the machine used for the left- easily programmed and tested. When the encoding be- most column is roughly 3x slower than the machine for the comes stable, speed-up can be obtained by less declara- 4 Il Milione: A Journey in the Computational Logic in Italy tive methods. The results obtained are promising for the [11] A. Dal Palù, A. Dovier, and F. Fogolari. Constraint success of the application of the same approach to other logic programming approach to protein structure challenging problems of Bioinformatics. prediction. BMC Bioinformatics 5(186), 2004. [12] A. Dal Palù, A. Dovier, and E. Pontelli. A Constraint Acknowledgements. The research summarized in this Logic Programming Approach to 3D Structure De- paper would not have been possible without the help of termination of Large Protein Complexes. Proc. of my valuable colleagues and real friends Alessandro Dal LPAR05, pp. 48–63, 2005. Palù, Federico Fogolari, and Enrico Pontelli. I would also like to thank Luca Bortolussi, Elisabetta De Maria, Andrea [13] A. Dal Palù, A. Dovier, and E. Pontelli. Heuristics, Formisano, Angelo Montanari, and Carla Piazza for their Optimizations, and Parallelism for Protein Structure collaboration. The research has been partially supported Prediction in CLP(FD). Proc. of PPDP05, pp. 230– by the FIRB RBNE03B8KK, and by PRIN and GNCS 241, ACM, Lisbon 2005. projects. [14] A. Dal Palù, A. Dovier, and E. Pontelli. Global constraints for Discrete Lattices. Proc. of WCB06, REFERENCES Nantes, pp. 55–68, 2006. [1] R. Backofen. The protein structure prediction prob- [15] A. Dal Palù, A. Dovier, and E. Pontelli. The den- lem: A constraint optimization approach using a sity constraint. Proc. of WCB07, Porto, pp. 10–19, new lower bound. Constraints 6:223–255, 2001. 2007. [2] R. Backofen and S. Will. Excluding Symmetries in [16] A. Dal Palù, A. Dovier, and E. Pontelli. A constraint Constraint-Based Search. Constraints 7(3–4):333– solver for discrete lattices, its parallelization, and 349, 2002. application to protein structure prediction. Software [3] R. Backofen and S. Will. A Constraint-Based Ap- Practice and Experience, DOI: 10.1002/spe. proach to Fast and Exact Structure Prediction in 3- 810. Dimensional Protein Models. Constraints 11(1):5– [17] A. Dal Palù, A. Dovier, and S. Will (eds.) Special 30, 2006. issue on Constraint Based Methods for Bioinformat- [4] R. Backofen, S. Will, and E. Bornberg-Bauer. Ap- ics. Constraints 13(1–2), 2008. plication of constraint programming techniques for [18] E. De Maria, A. Dovier, A. Montanari, and C. Pi- structure prediction of lattice proteins with extended azza. Exploiting Model Checking in Constraint- alphabets. Bioinformatics 15(3): 234-242, 1999. based Approaches to the Protein Folding. Proc. of [5] P. Barahona and L. Krippahl. Constraint Program- WCB06, pp.46-54, Nantes, 2006. ming in Structural Bioinformatics. [17]:3–20. [19] K. A. Dill. Dominant forces in protein folding. Bio- [6] M. Berrera, H. Molinari, and F. Fogolari. Amino chemistry 29:7133-7155, 1990. acid empirical contact energy definitions for fold [20] A. Dovier, M. Burato, and F. Fogolari. Using Sec- recognition in the space of contact maps. BMC ondary Structure Information for Protein Folding in Bioinformatics 4(8), 2003. CLP(FD). Proc. of WFLP02, ENTCS 76, 2002. [7] L. Bortolussi, A. Dovier, and F. Fogolari. Agent- [21] A. Dovier, A. Formisano, and E. Pontelli. Mul- based Protein Structure Prediction. Multiagent and tivalued Action Languages with Constraints in Grid Systems 3(2):183–197, 2007. CLP(FD). Proc. of ICLP07, LNCS 4670, pp. 255– [8] R. Cipriano, A. Palù, and A. Dovier. A hybrid 270, 2007. approach mixing local search and constraint pro- [22] L. Krippahl and P. Barahona. PSICO: Solving Pro- gramming applied to the protein structure prediction tein Structures with Constraint Programming and problem. Proc. of WCB08, Paris, 2008. Optimisation. Constraints, 7:317–331, 2002. [9] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Pic- [23] A. Shmygelska and H. H. Hoos. An ant colony opti- colboni, and M. Yannakakis. On the Complexity of misation algorithm for the 2D and 3D hydrophobic Protein Folding, Journal of Computational Biology, polar protein folding problem. BMC Bioinformatics 5(3):423–466, 1998. 6(30), 2005. [10] A. Dal Palù, A. Dovier, and F. Fogolari. Protein [24] J. Skolnick and A. Kolinski. Reduced models of Folding in CLP(FD) with Empirical Contact En- proteins and their applications. Polymer, 45:511– ergies. Proc. of CSCLP03:250–265, LNCS 3010, 524, 2004. 2004. 5 Il Milione: A Journey in the Computational Logic in Italy 6 Contacts Agostino Dovier Dip. di Matematica e Informatica Univ. di Udine Via delle Scienze 206, 33100 Udine (UD) Tel: +39 0432 558494 E-mail: dovier@dimi.uniud.it 7 Biography Agostino Dovier received his PhD in Computer Science from the University of Pisa in 1996 and he is an Associate Professor of Computer Science at the University of Udine. His current research interests include the development and the application of declarative programming languages with constraints and Bioinformatics. He is member of AI*IA and of the EC of GULP and ALP and he has published over 60 international referred publications. He served as program chair or in the program committee of several conferences and workshops of logic and constraint programming, as guest editor of special issues of international journals, and he has coordinated some research projects in the area of Constraint Logic Programming and Bioinformatics. He is the general chair of ICLP08 (International Conference on Logic Programming). 6