<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLP-based protein fragment assembly</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A. Dal Palu`</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Dovier</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>F. Fogolari</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E. Pontelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New Mexico State Univ., Dept. Maths and Computer Science</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Parma</institution>
          ,
          <addr-line>Dip. di Matematica</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Univ. Udine, Dip. di Matematica e Informatica</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Univ. Udine, Dip. di Scienze e Tecnologie Biomediche</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper investigates a novel approach, based on Constraint Logic Programming (CLP), to predict the 3D conformation of a protein via fragments assembly. The fragments are extracted by a preprocessor-also developed for this work- from a database of known protein structures that clusters and classifies the fragments according to similarity and frequency. The problem of assembling fragments into a complete conformation is mapped to a constraint solving problem and solved using CLP. The constraint-based model uses a medium discretization degree Cα-side chain centroid protein model that offers efficiency and a good approximation for space filling. The approach adapts existing energy models to the protein representation used and applies a large neighboring search strategy. The results show the feasibility and efficiency of the method. The declarative nature of the solution allows to include future extensions, e.g. different size fragments for better accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Proteins are central components in the way they control and execute the vital
functions in living organisms. The functions of a protein are directly related to
its peculiar three-dimensional conformation. Knowledge of the three-dimensional
conformation of a protein (also known as the native conformation or tertiary
structure) is essential to biomedical investigation. The native conformation
represents the functional protein and determines how it can interact with other
proteins and affect the functions of the hosting organism. It is impossible to
clearly understand the behavior and phenotype of an organism without
knowledge of the native conformation of the proteins coded in its genome. As a result
of advancements in DNA sequencing techniques, there is a large and growing
number of protein sequences—i.e., lists of amino acids, also known as primary
structures of proteins—available in public databases (e.g., the database
Swissprot contains about 500, 000 protein sequences). On the other hand, knowledge
of structural information (e.g., information concerning the tertiary structures)
is lagging behind, with a much smaller number of structures deposited in public
databases, notwithstanding structural genomics initiatives started worldwide.</p>
      <p>For these reasons, one of the most traditional and central problems addressed
by research in bioinformatics deals with the protein structure prediction problem,
i.e., the problem of using computational methods to determine the native
conformation of a protein starting from its primary sequence. Several approaches have
been explored to address this problem. In broad terms, it is possible to
distinguish between two main classes of approaches. The more traditional term protein
structure prediction has been commonly used to describe methods that rely on
comparisons between known and unknown proteins to predict the end-result of
the spontaneous protein folding process (also known as homology modeling). The
more expensive task of predicting a protein structure starting from the
knowledge of the chemical structure and laws of physics (known as de-novo/ab-initio
prediction) has been studied with different levels of accuracy and complexity.
Instead, protein folding simulations tries to understand the folding path leading
to the native conformation, typically using investigations of the potential energy
landscape or using molecular dynamics simulations. In both classes of methods,
knowledge of “patterns” can be used to restrict the search space—and this is
particularly true for the case of secondary structure components of a protein,
i.e., local helices or strands. Secondary structure components are important,
considering that their formation is believed to represent the earliest phase of the
folding process, and their identification can be relatively simpler (e.g., through
low-resolution observations of images from electron microscopy).
1.1</p>
      <p>
        Related work
As mentioned above, it is possible to predict protein structures based on their
sequences, using either homology modeling or fold recognition techniques.
Nevertheless, it is in general difficult to predict a protein structure based only on
its sequence and in absence of structural templates. Explicit solvent molecular
dynamics simulations of protein folding are still beyond current computational
capabilities. Already in 1968, Levinthal postulated that the systematic
exploration of the space of possible conformations is infeasible [12]. This complexity
has been confirmed by theoretical results, showing that even extremely
simplified formalizations of the problem lead to computationally intractable
problems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Recently, ab-initio methods for generating protein structures given their
sequences have been proposed and successfully used [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Key elements of these
methods are the use of evolutionary information from multiple alignments, the
adoption of simplified representations of proteins, and the introduction of
fragments assembly techniques. These methods rely on assembling the structure
using simplified representations of protein fragments with favorable conformations
(obtained from structural databases) for the profile of the given sequence. Three
to nineteen-residue fragments (i.e., 3–9 for small proteins and 5–19 for large
ones) contain correlations among neighboring residue conformations, and most
of the computation time is spent in finding the global arrangement of fragments
that already display good local conformations [14, 16].
      </p>
      <p>
        Several simplified models have been introduced to address the problem.
Simplified models abstract several properties of both proteins and space, leading
to a version of the problem that can be solved more efficiently. The solutions
of the simplified problem constitute candidate configurations that can be
refined with more computationally intensive methods, e.g., molecular dynamics
simulations. Possible simplifications include the introduction of fixed sizes of
monomers and bond lengths, the representation of monomers as simple points,
and viewing the three-dimensional space as a discretized collection of points.
In these simplified models, it is possible to view the protein folding problem as
an optimization problem, aimed at determining conformations that minimize an
energy function. The energy function must be defined according to the simplified
model adopted [
        <xref ref-type="bibr" rid="ref4">4, 11</xref>
        ]. Simplified energy models have been devised specifically to
solve large instances of the structure prediction problem using constraint solving
approaches [
        <xref ref-type="bibr" rid="ref1">1, 9</xref>
        ].
      </p>
      <p>
        In our own previous efforts, we have applied declarative programming
approaches, based on constraint solving, to address the problem. We built our
efforts on a discrete crystal lattice organization of the allowable points in the
three-dimensional space. This representation exploits the property that the
distance between the Cα atoms5 of two consecutive amino acids is relatively
constant (3.8A˚). The problem is viewed as placing amino acids in the allowable
points, in such a way that constraints encoding the mutual distances of amino
acids in the primary sequence are satisfied [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The original framework has also
been expanded to support several types of global constraints, i.e., constraints
describing complex relationships among groups of amino acids. One of these
constraints is the rigid structure constraint—this constraint enables the
representation of known substructures (e.g., secondary structure components), reducing
the problem to the determination of an appropriate placement and rotation of
such substructures in the lattice space. The ability to use rigid structure
constraints has been shown to lead to dramatic reductions in the search space [
        <xref ref-type="bibr" rid="ref2">8, 2</xref>
        ].
However, exploiting knowledge of real rigid substructures in a discrete lattice is
infeasible, due to the errors introduced by the approximations imposed by the
discretized representation of the search space. On the other hand, the use of a
less constrained space model leads to search spaces that are intractable. These
two considerations lead to the new approach presented in this work.
1.2
      </p>
      <p>The contribution of this work
Some of the most successful approaches to protein folding build on the
principles of using substructures. The intuition is that, while the complete folding of
a protein may be unknown, it is likely that all possible substructures, if
properly chosen, can be found among proteins whose conformations are known. The
folding can be constructed by exploiting relationships among substructures. A
notable example of this approach is represented by Rosetta [14]—an ab
initio protein structure prediction method that uses simulated annealing search to
compose a conformation, by assembling substructures extracted from a fragment
library; the library is obtained from observed structures stored in the Protein
Data Bank (PDB, www.pdb.org).
5 A carbon atom that is a convenient representative of the whole amino acid</p>
      <p>In this work, we follow a similar idea, by developing a database of amino
acid chains of length 4; these are clustered according to similarity, and their
frequencies are drawn from the investigation of a relevant section of the Protein
Data Bank. The database contains the data needed to solve the protein folding
problem via fragments assembly. Declarative programming techniques are used
to generate clean and compact code, and to enable rapid prototyping. Moreover,
the problem of assembling substructures is efficiently tackled using the constraint
solving techniques provided by CLP (F D) systems.</p>
      <p>This paper has the goal of showing that our approach is feasible. The main
advantage, w.r.t. a highly engineered and imperative tool, is the modularity of
the constraint system, which offers a convenient framework to test and integrate
statistical data from various predictors and databases. Moreover, the constrained
search technique itself represents a novel method, compared to popular
predictors, and we show its effectiveness in combination with the development of new
energy functions and heuristics. The proposed solution includes a general
implementation of large neighboring search in CLP (F D), that turned out to be
highly effective for the problem at hand. Another contribution is the
development of a new energy function based on two components: a contact potential
for backbone and side chain centroids interaction, and an energy component for
backbone conformational preferences. Backbone and side chain steric hindrances
are imposed as constraints.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Protein Abstraction</title>
      <p>Preliminary notions
We focus on proteins described as sequences of amino acids selected from a set
A of 20 elements (those coded by the human genome). In turn, each amino acid
is composed of a set of atoms that constitute the amino acid’s backbone (see
Fig. 1) and a set of atoms that differentiate pairs of amino acids, known as side
chain. One of the most important structural properties is that two consecutive
Cα atoms have an average distance of 3.8A˚. Side chains may contain from 1 to
18 atoms, depending on the amino acid. For computational purposes, instead
of considering all atoms composing the protein, we consider a simplified model
in which we are interested in the position of the Cα atoms (representing the
backbone of the protein) and of particular points, known as the centroids of the
side chains (Fig. 3). A natural choice for the centroid is the center of mass of
the side chain.</p>
      <p>It is important to mention that, once the positions of all the Cα atoms and
of all the centroids are known, the structure of the protein is already sufficiently
determined, i.e., the position of the remaining atoms can be identified almost
deterministically with a reasonable accuracy.</p>
      <p>Focusing on the backbone and on the Cα atoms, three consecutive amino
acids define one bend angle (see Fig. 2—left). Consider now four consecutive
amino acids a1, a2, a3, a4. The angle of the vector connecting a3 to a4 w.r.t. the
ALA 7→ 0 LEU 7→ 1
MET 7→ 1 ARG 7→ 2
GLU 7→ 2 GLN 7→ 2
LYS 7→ 2 ASN 7→ 3
ASP 7→ 3 SER 7→ 3
THR 7→ 4 PHE 7→ 4
HIS 7→ 4 TYR 7→ 4
ILE 7→ 5 VAL 7→ 5
TRP 7→ 5 CYS 7→ 6
GLY 7→ 7 PRO 7→ 8
'
XXXz N
|scihdaein
CαXX C′
H</p>
      <p>'
$H
: NXX</p>
      <p>H
Cα</p>
      <p>$
O</p>
      <p>CX′XXz
|scihdaein</p>
      <p>H O &amp; %
&amp; %
Fig. 1. Two consecutive amino acids and the clustering of amino acids into 9 classes
plane formed by a1, a2, a3 is called torsional angle (see Fig. 2—right). If these
angles are known for all the consecutive 4-tuples forming a protein, they uniquely
describe the 3D positions of all the Cα atoms of the protein.</p>
      <p>Given a spatial conformation of a 4-tuple of consecutive Cα atoms, a small
degree of freedom for the position of the side chain is allowed—leading to
conformers commonly referred to as rotamers. To reduce the search space, we do
not consider such variations. Once the positions of the Cα atoms are known, we
deterministically add the positions of the centroids. In particular, the centroid of
the i-th residue is constructed by using the positions of Cαi−1, Cαi and Cαi+1
as reference and by considering the average of the center of mass of the same
amino acid type centroids, sampled from a non-redundant subset of the PDB.
The parameters that uniquely determine its position are: the average Cα-side
chain center of mass distance, the average bend angle formed by the side chain
center of mass-Cαi-Cαi+1, and the torsional angle formed by the
Cαi−1-CαiCαi+1-side chain center of mass. Even with this simplification, the introduction
of the centroids in the model allows us to better cope with the layout in the 3D
space and to use a richer energy model (Fig. 3). This has been experimentally
shown to produce more accurate results, without adding extra complexity w.r.t.
a model that considers only the positions of the Cα atoms and without the use
of centroids.
Although more than 60,000 protein structures are present in the PDB, the
complete set of known proteins contains too much redundancy (i.e., very similar
proteins deposited in several variants) to be useful for statistical purposes.</p>
      <p>To reduce the impact of redundancy, in our approach, we focused on a subset
of the PDB with limited redundancies—in particular, we used the so-called
top500 set [13].6 This set contains 500 proteins, with 107, 138 occurrences of amino
acids. The number of different 4-tuples occurring in the set is precisely 62, 831.
Since the number of possible 4-tuples of amino acids is |A4| = 204 = 160, 000,
this means that most 4-tuples do not appear in the selected set; even those
that appear, occur too rarely to provide significant statistical information. For
this reason, we decided to cluster amino acids into 9 classes, according to the
similarity of the torsional angles of the pseudo bond between two consecutive
Cα atoms [11]. Let γ : A −→ {0, . . . , 8} be the function assigning a class to
each amino acid, as defined in Fig. 1; for i ∈ {0, . . . , 8}, let us denote with
γ−1(i) = {a ∈ A : γ(a) = i}. In this way, the majority of the 94 = 6, 561
4-tuples have a representative in the set (precisely, there are templates for 5, 830
of them).</p>
      <p>A second level of approximation is introduced in deciding when two
occurrences of the same 4-tuple have the “same” form. The two tuples are placed in
the same class when their Root Mean Square Deviation (RMSD) is less than or
equal to a given threshold (rmsd thr); this threshold is currently set to 1.0A˚.
We developed a C program, called tuple generator, that creates a set of facts
of the form:
tuple( [g1, g2, g3, g4],
[X1α, Y1α, Z1α, X2α, Y2α, Z2α, X3α, Y3α, Z3α, X4α, Y4α, Z4α],
g2-centroid-list, g3-centroid-list,</p>
      <p>FREQ, ID, PID)
where [g1, g2, g3, g4] ∈ {0, . . . , 8}4 identifies the class of each amino acid, X1α, . . . , Z4α
are the coordinates of the 4 Cα atoms of the 4-tuple,7 FREQ ∈ {0, . . . , 1000} is a
frequency factor of the template w.r.t. all occurrences of the 4-tuple g1, . . . , g4 in
6 Note that our program tuple generator, developed to extract the desired
information, can work on any given set of known proteins.
7 Without loss of generality, we set (X1α, Y1α, Z1α) = (0, 0, 0).
the set top-500, ID is a unique identifier for this fact, and PID is the first protein
found containing this template; this last piece of information will be printed in
the file we produce as output of the computation, in order to allow one to recover
the source of a fragment used for the prediction.</p>
      <p>As discussed in Sect. 2.1, we model the position of the centroid of the side
chain of every amino acid. gi ∈ {0, . . . , 8} is a representative of the class of amino
acids γ−1(gi) (e.g., γ−1(2) = {ARG, GLU, GLN, LYS}—see Fig. 1). For each
amino acid a ∈ γ−1(gi), we compute the position of the centroid corresponding to
the positions X1α, . . . , Z4α of the Cα atoms, and add it to the gi-centroid-list.
Let us observe that we do not add the position of the first and last centroid in
the 4-tuples. As a result, at the end of the computation, only the centroid of the
first and the last amino acids of the entire protein will be not set; these can be
assigned using a straightforward post-processing step.</p>
      <p>It is unlikely that a 4-tuple a1, . . . , a4 that does not appear in the considered
training set will occur in a real protein. Nevertheless, in order to handle these
cases, if [γ(a1), . . . , γ(a4)] has no statistics associated to it, we map it to the
special 4-tuple [−1, −1, −1, −1]. By default, we assign to this unknown tuple a
set of templates (currently 6, but the number can be easily increased) chosen
among the most common ones in the set of all known templates. Other special
4tuples are [−2, −2, −2, −2] and [−3, −3, −3, −3]; these are assigned to α-helices
and β-sheets every time a secondary structure constraint is enforced locally on
a region of the conformation.</p>
      <p>We also introduce an additional collection of facts, based on the predicate
next, which are used to relate pairs of tuple facts. The relation next(ID1, ID2, Mat)
holds if the last three amino acids of the sequence in the tuple fact identified
by ID1 are identical to the first three amino acids of the sequence in the tuple
fact ID2, and the corresponding Cα sequences are almost identical—in the sense
that the RMSD between them is at most rmsd thr. Mat is the rotation matrix
to align the 1–3 Cα atoms of the 4-tuple of ID2 with the 2–4 Cα atoms of the
4-tuple of ID1.
2.3</p>
      <p>
        Statistical energy
The energy function used in this work builds on two components: (1) a contact
potential for side chain and backbone contacts and (2) an energy component
for each backbone conformation based on backbone conformational preferences
observed in the database. The first component uses the table of contact energies
described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This set of energies has been shown to be accurate, and it is
the result of filtering the dataset to allow accurate statistical analysis, and a
consequent fine tuning of the contact definition. Since not all side chain atoms
are represented, it is not possible to directly use the definition of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; therefore,
we retain the table of contact energies of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but we adapt the contact definition
to the side chain centroid. In particular, we use the value 3.2A˚ as the shortest
distance between two centroids, and we set the contact distance between two
centroids to be less than or equal to the sum of their radii. Larger distances
provide a contribution that decays quadratically with the distance.
      </p>
      <p>The radius of each centroid is computed as the centroid’s average distance
from the Cα atom. No contact potential is assigned to the side chain’s Cα, as
the side chain definition already includes the Cα carbon. The energy assigned
to the contact between backbone atoms represented by Cα (not included in the
side chain definition) is the same energy assigned to ASN-ASN contacts, which
involve mainly similar chemical groups contacts. The torsional angle defined by
four consecutive Cα atoms is assigned an energy value defined by the potential of
the mean force derived by the distribution of the corresponding torsional angle
in the PDB. The procedure has been thoroughly described in [11].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Modeling</title>
      <p>In this section, we describe the modeling of the problem of fragments
assembly using constraints over finite domains—specifically, the type of constraints
provided in CLP (F D). The input is a list Primary of n amino acids. We will
denote with ai the ith element of the primary sequence. We also allow PDB
identifiers as inputs; in this case, the primary structure of the protein is retrieved
from the PDB. Two lists of variables are generated. The first list (Code)
contains n − 3 variables. The i-th variable Ci of Code corresponds to the 4-tuple
(γ(ai), . . . , γ(ai+1), γ(ai+2), γ(ai+3)). The possible values for Ci are the IDs of
the facts of the form</p>
      <p>tuple([γ(ai), γ(ai+1), γ(ai+2), γ(ai+3)], , , Freq, ID, ).</p>
      <p>This set is ordered using the frequency information Freq in decreasing order,
and stored in a variable ListDomi.</p>
      <p>The next information is used to impose constraints between Ci and Ci+1.
Using the combinatorial constraint table, we allow only pairs of consecutive
values supported by the next predicate. Recall that, for each allowed combination
of values, the next predicate returns the rotation matrix Mi,i+1, which provides
the relative rotation when the two fragments are best fit.</p>
      <p>The second list of variables (Tertiary) contains two lists of 3n variables:
Xiα, Yiα, Ziα (resp., XiC, YiC , ZiC) denoting the 3D position of the Cα atoms
(resp., of the centroids). These variables have integer values (representing a
precision of 10−2˚A).</p>
      <p>In order to correlate Code variables and Tertiary variables, consecutive
4tuples must be constrained. Let us focus on the Cα part; consider two consecutive
tuples:
• ti = ai, ai+1, ai+2, ai+3 with code variable Ci, and
• ti+1 = ai+1, ai+2, ai+3, ai+4, with code variable Ci+1 and local coordinates</p>
      <p>V 1, V 2, V 3, V 4.
ti+1 is rotated as to best overlap the points in common with ti, and it is placed
so that the last point in ti+1 is at 3.8A˚ from the last point in ti.</p>
      <p>Let Xiα, Yiα, Ziα, . . . , Xiα+4, Yiα+4, Ziα+4 be the variables for the coordinates of
these Cα atoms, stored in the list Tertiary (Fig. 4, where Pi = (Xiα, Yiα, Ziα)).
The constraint introduced rotates and translates the template ti+1 from the
reference of Ci (represented by the orthonormal basis matrix Ri) according to the
rotation matrix Mi,i+1 to the new reference Ri+1 = Ri × Mi,i+1. Moreover, when
placing the template ti+1, the constraint affects only the coordinates of ai+4,
since the other variables are assigned by the application of the same constraint
for templates tj , j &lt; i + 1. The constraint shifts the rotated version of ti+1
so that it overlaps the third point V 3 with (Xiα+3, Yiα+3, Ziα+3). Formally, let
V rk = Ri+1 × V k, with k ∈ {1 . . . 4}, be the rotated 4-tuple corresponding to
Ci+1. The shift vector s = (Xiα+3, Yiα+3, Ziα+3) − V r3 is used to constrain the
position of ai+4 as follows:</p>
      <p>(Xiα+4, Yiα+4, Ziα+4) = s + Ri+1 × V 4.</p>
      <p>Note that the 3.8A˚ distance between consecutive amino acids (i.e., ai+3 and
ai+4) is preserved, and this constraint allows us to place templates without
requiring an expensive RMSD fit among overlapping fragments during the search.
Moreover, during a leftmost search, as soon as the variable Ci is assigned, also
the coordinates (Xiα+3, Yiα+3, Ziα+3) are uniquely determined.</p>
      <p>Matrix and vector products are handled by FD variables and constraints,
using a factor of 1000 for storing and handling rotation matrices.</p>
      <p>For the sake of simplicity, we omit the formal description of the constraints
associated to the centroids. The centroids’ positions are determined similarly,
e.g., as soon as the positions of the Cα atoms are determined, the centroid
positions are rotated and shifted accordingly.</p>
      <p>The X1α, Y1α, Z1α, . . . , Xnα, Ynα, Znα part of the Tertiary list relative to the
position of the Cα atoms, is also required to satisfy a constraint which
guarantees the all distant property [8]: the Cα atoms of each pair of non-consecutive
amino acids must be distant at least D = 3.2A˚. This is expressed by the
constraint:</p>
      <p>(Xiα − Xjα)2 + (Yiα − Yjα)2 + (Ziα − Zjα)2 ≥ D2
for all i ∈ {1, . . . , n − 2} and j ∈ {i + 2, . . . , n}. Similar constraints are imposed
between pairs of Cα and centroids as well as pairs of centroids. In the latter case,
in order to account for the differences in volume of each possible side chain, we
determine minimal distances that depend on the specific type of amino acid
considered.</p>
      <p>
        Other constraints are added to guide the search. We allow the user to set
a diameter parameter that is used to bound the maximum distance between
different Cα atoms (i.e., the diameter of the protein). As we argued in earlier
work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a good diameter value is 5.68 n0.38 ˚A. We impose this constraint to all
different pairs of Cα atoms.
The native structure of a protein is largely composed of some recurrent local
structures. These structures, α-helices and β-sheets, can be predicted with good
accuracy using efficient techniques, such as neural networks, or recognized using
other techniques (e.g., analysis of density maps from electron microscopy). Being
based on frequency analysis, our tool is able to discover the majority of secondary
structures. On the other hand, a-priori knowledge of these structures allows us
to remove several non-deterministic choices during computation. Therefore, we
allow knowledge of secondary structures to be provided as part of the input—e.g.,
information indicating that the amino acids i–j form an α-helix. In the processing
stage, for k ∈ {i, . . . , j−3}, a particular tuple [−2, −2, −2, −2] is assigned instead
of the tuple [γ(ak), . . . , γ(ak+3)]. This tuple has a unique template which is
associated to an α-helix, built in a standard way using a bend angle of 93.8
degrees and a torsional angle of 52.3 degrees. Moreover, a list of the possible
positions for the centroids of the 20 amino acids is retrieved. Since the domains
for these Ck’s are singletons, as soon as Ci is considered for value assignment,
all the points of the helix are deterministically computed. The situation in the
case of β-strands is analogous.
      </p>
      <p>A variation of this technique can be used if larger and more complex
substructures are known. Basically, even keeping just 4-tuples as internal data structures,
we can easily deal with tuples of arbitrary lengths. Automated manipulation of
arbitrary complex structures is subject of future work.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Searching</title>
      <p>The search is guided by the instantiation of the Ci variables. These variables
are instantiated in leftmost-first order; in turn, the values in their domains are
attempted starting with the most probable value first. We observed that a
firstfail strategy does not speed up the search, probably due to the weak propagation
of the matrix product constraints. As described in Section 2.3, we associate an
energy to each computed structure. The energy value is an FD constraint that
links coordinates variables and amino acids. Given the model of the problem,
this kind of constraint is not able to provide effective bounds for pruning the
search space when searching for optimal solutions. As future work, we plan to
investigate specific propagators, since the torsional energy contribution could be
exploited for exact bounds estimations.</p>
      <p>Each computed structure is saved in pdb format. This is a standard format
for proteins (detailed in the PDB repository) that can be processed by most
protein viewers (e.g., Rasmol, ViewerLite, JMol).
4.1</p>
      <p>Large Neighboring Search
In order to further reduce the time to search for solutions, we have developed
a logic programming implementation of Large Neighboring Search (LNS) [15].
LNS is a form of local search, where the search for the successive solutions is
performed by exploring a “large” neighborhood. The traditional basic move used
in local search, where the values of a small number of variables are changed, is
replaced by a more general move, where a large number of variables is allowed
to change, and these variables are subject to constraints. The basic LNS routine
is the following:
1. Generate an initial solution (e.g., using the standard CLP labeling
procedure).
2. Randomly select a subset of the variables of the problem that are admissible
for changes, and assign the previous values to the other variables.
3. Using again standard labeling, look for an assignment of the selected
variables that improve the energy/cost function (or look for an assignment that
ensures the best improvement possible of the energy/cost function). In any
case, go back to step 2.</p>
      <p>A timeout mechanism is typically adopted to terminate the cycle between steps
2 and 3. For example, the procedure is terminated if either a general time-out
occurs or k iterations are performed without any improvement in the quality of
the solution. In these cases, the best solution found is returned.</p>
      <p>This simple procedure can be improved by occasionally allowing a worsening
move—this is important to enable the procedure to abandon a local minimum.
The process can be implemented by introducing a random variable; whenever
such variable is assigned a certain value (with a low probability), then an
alternative move which worsens the quality of the solution is performed; otherwise
we continue using the scheme described earlier.</p>
      <p>The scheme has been implemented in CLP (F D). Even though the
implementation has been developed to meet the needs of the protein folding problem,
the scheme can be adapted with minimal changes to address the needs of other
constraint optimization problems coded in CLP (F D). The logical variables of
CLP (F D), being single-assignment, are not suitable to the process of modifying
a solution as requested by LNS. Therefore, a specific combination of
backtracking, assertions and reassignment procedures are enforced, in order to reset only
the assignments to the CSP (while the constraints are maintained) and to
reassign previous values to the variables that are not going to be changed. The
remaining variables can assume any assignment compatible with the constraints.</p>
      <p>The implementation we developed avoids these repetitions, by using
extralogical features of Prolog to record intermediate solutions in the Prolog database
(using the assert/retract predicates). The loss of declarativity is balanced by
enhanced performance in the implementation of LNS.
The prototype can search the first admissible solution (pf id(ID,Tertiary)),
where ID is a protein name included in the database (prot list.pl); the Primary
sequence is also admitted directly as input. It is possible to generate the first N
solutions and output them as distinct models in a single pdb file, or to create
different files for all the solutions found within a Timeout. Finally, LNS can be
activated by pf id lns(ID,Timeout). A version of the current CLP (F D)
im1: lns(ID, Time)
:2: constraint(ID,Code,Energy,Primary,Tertiary),
3: retractall(best(_,_,_)), assert(best(0,Primary,_)),
4: retractall(last_sol(_,_,_)), assert(last_sol(0,null,_)),
5: time_out(local(Code,Energy,Primary,Tertiary), Time,_),
6: fail.
7: lns(_,_)
:8: best(0,_,_,_),!,
9: write(’Insufficient time for the first solution’),nl.
10: lns(_,_)
:11: best(Energy,Primary,Tertiary),
12: print_results(Primary,Tertiary,Energy).
13: local(Code,Energy,
Primary,Tertiary):14: \+ another_sol(Code,Energy, Primary,Tertiary),
15: local(Code,Energy, Primary,Tertiary).
16: another_sol(Code, Energy, Primary, Tertiary)
:17: last_sol(Lim, LastSol,LastTer),
18: ( LastSol = null -&gt;
19: labeling(Code,Tertiary);
20: LastSol \= null -&gt;
21: random(1,11,Type),
22: (Type =&lt; 1 -&gt; Lim1 is 5*Lim//6, Energy #&gt; Lim1 ;
23: Type &gt; 1 -&gt; Energy #&lt; Lim ),
24: large_move(Code,LastSol,Tertiary,LastTer),
25: time_out( labeling(Code,Tertiary), 120000, Flag)),
26: (Flag == success -&gt; true ;
27: Flag == time_out -&gt;
28: write(’Warning: Time out in the labeling stage’),nl,
29: fail)),
30: !,
31: retract(last_sol(_,_,_)),
32: assert(last_sol(Energy,Code,Tertiary)),
33: best(Val,_,_),
34: (Val &gt; Energy -&gt;
35: retract(best(_,_,_)),
36: assert(best(Energy,Primary,Tertiary));
37: true),
38: fail.</p>
      <p>Fig. 5. An implementation of LNS in Prolog
plementation, along with a set of experimental tests, is available at www.dimi.
uniud.it/dovier/PF.</p>
      <p>The experimental tests have been performed on an AMD Opteron 2.2GHz
Linux Machine. Each computation was performed on a single processor using
SICStus Prolog 4.0.4. Some of the experimental results are reported in Table 1.
The execution times reported correspond to the time needed to compute the
best solution computed within the allowed execution time (s stands for seconds,
m for minutes).</p>
      <p>We consider 8 proteins and perform an exhaustive search of 6.6 hours. For
other 4 proteins, we perform both enumeration (left) and LNS search for 2 days
(right). Moreover we launch LNS for 2 hours (center).</p>
      <p>For every protein, we impose the secondary structure information as specified
in the corresponding PDB annotations. However, we wish to point out that the
12 proteins tested are not included in the top-500 Database from which we
extracted the 4-tuples. For each protein analyzed using LNS, we performed 4
independent runs, anticipated by a randomize statement, since the process relies
on random choices. We report the best result (in terms of RMSD) and the energy
associated.</p>
      <p>PID N Enumerate 6.6 hours</p>
      <p>RMSD Energy T (s)
1KVG 12 2.79 -59122 9.88
1EDP 17 3.04 -112755 73.00
1LE0 12 3.12 -45575 3.20
2GP8 40 5.96 -266819 5794.88</p>
      <p>PID N Enumerate 6.6 hours</p>
      <p>RMSD Energy T (s)
1LE3 16 3.90 -69017 218.79
1ENH 54 5.10 -467014 8553.92
1PG1 18 3.22 -109456 11.00
2K9D 44 6.99 -460877 1453.44</p>
      <p>
        The enumeration search is expected to perform better for smaller proteins,
where it is possible to explore a large fraction of the search space within the set
time limit. It is interesting to note that, for smaller chains, the RMSD w.r.t. the
native conformation in the PDB is rather small (ca. 3A˚); this indicates that the
best solutions found capture the fold of the chain, and the determined solutions
can be refined using molecular dynamics simulations, as done in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Moreover, the proteins 1ZDD, IVII, and 1AIL have been simulated both with
enumeration and LNS search strategy, in order to show that the latter is able
to improve both quality (i.e., RMSD) and computational time (up to 200 times
faster), thanks to the neighborhood exploration. In the case of 1ZDD and 1AIL,
where there are three helices packed together, the random choice of good pivot
moves effectively guides the folding towards the optimal solution.</p>
      <p>Finally, the protein 2IGD shows that the allowed time was not sufficient to
determine a sensible prediction, even though the LNS shows enhancements in
quality of the prediction, given the same amount of time. In this case, additional
work is needed to improve the choice of moves performed by LNS, in order to
explore the neighborhood in a more effective way. For example, moves that shift
and translate some rigid parts of the protein may improve the performance and
quality.</p>
      <p>We also note that the size of the set of fragments in use is sufficient to allow
reasonable solutions, while allowing tractable search spaces. Clearly, the use of
a more refined partitioning of the fragments (by reducing the rmsd thr) would
produce conformations that are structurally closer to the native state, at the
price of an increase in the size of the search space.</p>
      <p>In Figure 6, we depict the native state (on the left) and our best prediction
(on the right) for the protein 1ENH with the backbone and centroids. In the
middle, we show the RMSD overlap of the Cα atoms between the native
conformation (red/dark gray) and the prediction (green/light gray). The main features
are preserved and only the right loop that connects the two helices appears to
have moved significantly. This could be avoided by introducing a richer set of
alternative fragments in that area and thus allow a more relaxed placement of
the fragments.</p>
      <p>
        An important issue is that it is not obvious that the reduced representation
and the energy function used here are able to distinguish the native structure
from decoys constructed to minimize that energy function. The straightforward
comparison between RMSD and energy is not completely meaningful because the
native structure should be energy-minimized before energy comparison. This can
be noted by comparing the conformations obtained by enumeration and LNS.
The model in use improves w.r.t. the Cα model as presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However,
some fine tuning of the energy coefficients is still necessary in order to improve
the correlation, while preserving the overall constraint model. This will be a
further area of investigation. Our results show that the method can scale well
and that further speed-up may be obtained by considering larger fragments as
done by tools like Rosetta. Rosetta is in fact the state-of-the-art predictor tool
(e.g., the small protein 1ENH is predicted by Rosetta in less than one minute
with a RMSD of 4.2 A˚).
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>In this paper we presented the design and implementation of a constraint logic
programming tool to predict the native conformation of a protein, given its
primary structure. The methodology is based on a process of fragments assembly,
using templates retrieved from a protein database, that is clustered according
to shape similarity. We used templates based on sequences of 4 amino acids,
but longer sequences could be easily incorporated—and these could bring
statistically significant constraints describing conserved shapes. The constraint
solving process takes advantage of a large neighboring search strategy, which has
been shown to be particularly effective when dealing with longer proteins. The
preliminary experimental results confirm the strong potential for this fragment
assembly scheme.</p>
      <p>The proposed method has a significant advantage over schemes like Rosetta—
the use of CLP (F D) enables the simple addition of ad-hoc constraints and
experimentation with different local search moves and energy functions. As future
work, we plan to investigate, in particular, energy functions that better correlate
with RMSD w.r.t. the (known) native structures and the computed structures.
Other (possibly redundant) constraints and constraint propagation techniques
should be analyzed, possibly moving to the C++ solver Gecode.
Acknowledgments. The work is supported by GNCS-INdAM grant “Tecniche
innovative per la programmazione con vincoli in applicazioni strategiche”, PRIN
2007 “Computer simulations of Glutathione Peroxidases: insights into the
catalytic mechanism” and PRIN 2008 “Innovative multi-disciplinary approaches for
constraint and preference reasoning” grants.
8. A. Dal Palu`, A. Dovier, E. Pontelli. Computing Approximate Solutions of the
Protein Structure Determination Problem using Global Constraints on Discrete
Crystal Lattices. International Journal of Data Mining and Bioinformatics 4(1):1–
20, January 2010.
9. I. Dotu´ et al. Protein Structure Prediction with Large Neighborhood Constraint</p>
      <p>Programming Search. CP, LNCS 5202, pp. 82–96, 2008.
10. F. Fogolari et al. Modeling of polypeptide chains as C-α chains, C-α chains with
C-β, and C-α chains with ellipsoidal lateral chains. Biophysical Journal, 70:1183–
1197, 1996.
11. F. Fogolari et al. Scoring predictive models using a reduced representation of
proteins: model and energy definition. BMC Structural Biology, 7(15), 2007.
12. C. Levinthal. Are there pathways in protein folding? J. Chim. Phys. 65:44–45,
1968.
13. S. Lovell, et al. Structure validation by Cα geometry: φ, ψ and Cβ deviation.</p>
      <p>Proteins 50:437–450, 2003.
14. S. Raman, et al. Structure prediction for CASP8 with all-atom refinement using</p>
      <p>Rosetta. Proteins 77(S9):89–99, 2009.
15. P. Shaw. Using Constraint Programming and Local Search Methods to Solve</p>
      <p>Vehicle Routing Problems. Proc. of CP’98, LNCS 1520, pp. 417–431, 1998.
16. S. Wu, J. Skolnick, and Y. Zhang Ab initio modeling of small proteins by iterative
TASSER simulations. BMC Biology, 5(17), 2007.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Backofen</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Will</surname>
          </string-name>
          .
          <article-title>A Constraint-Based Approach to Fast and Exact Structure Prediction in 3-Dimensional Protein Models</article-title>
          .
          <source>Constraints</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Barahona</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Krippahl</surname>
          </string-name>
          .
          <article-title>Constraint Programming in Structural Bioinformatics</article-title>
          .
          <source>Constraints</source>
          <volume>13</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ben-David</surname>
          </string-name>
          , et al.
          <article-title>Assessment of CASP8 structure predictions for template free targets</article-title>
          .
          <source>Proteins</source>
          <volume>77</volume>
          (
          <issue>S9</issue>
          ):
          <fpage>50</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Berrera</surname>
          </string-name>
          et al.
          <article-title>Amino acid empirical contact energy definitions for fold recognition in the space of contact maps</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>4</volume>
          (
          <issue>8</issue>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          et al.
          <article-title>On the complexity of protein folding</article-title>
          .
          <source>STOC</source>
          :
          <fpage>61</fpage>
          -
          <lpage>62</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dal</surname>
          </string-name>
          <string-name>
            <surname>Palu`</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Fogolari</surname>
          </string-name>
          .
          <article-title>Constraint logic programming approach to protein structure prediction</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>5</volume>
          (
          <issue>186</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dal</surname>
          </string-name>
          <string-name>
            <surname>Palu`</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Pontelli</surname>
          </string-name>
          .
          <article-title>A constraint solver for discrete lattices, its parallelization, and application to protein structure prediction</article-title>
          .
          <source>Software: Practice and Experience</source>
          <volume>37</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1405</fpage>
          -
          <lpage>1449</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>