=Paper=
{{Paper
|id=Vol-429/paper-2
|storemode=property
|title=Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations
|pdfUrl=https://ceur-ws.org/Vol-429/paper2.pdf
|volume=Vol-429
|dblpUrl=https://dblp.org/rec/conf/eccb/ReumersSR08
}}
==Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations==
Using structural bioinformatics to investigate the impact of
non synonymous SNPs and disease mutations: scope and
limitations
Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1
1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ -
frederic.rousseau@vub.ac.be;
∗ Corresponding author
Abstract
Background: Linking structural effects of mutations to functional outcomes is a major issue in structural bioin-
formatics, and many tools and studies have shown that specific structural properties such as stability and residue
burial can be used to distinguish neutral variations and disease associated mutations.
Results: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot
Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties
can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the
annotation process from mutation to result and identified the liabilities in each step.
Conclusions: Although excellent annotation results of various research groups underline the great potential of using
structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations
cannot always be extrapolated to proteome wide variation studies. Difficulties for large-scale studies can be found
both on the technical level, i.e. the scarcity of data and the incompleteness of the structural tool suites, and on
the conceptual level, i.e. the correct interpretation of the results in a cellular context.
Background plete molecular phenotype may seem naive at first
glance, had it not been suggested that individual
The molecular phenotype of a coding non synony- properties such as protein stability, the accessibility
mous SNP or disease associated mutation describes of the amino acid substitution site, and the location
the functional and structural properties of a protein of variants in surface pockets are predictive deter-
that are affected by a single amino acid substitu- minants of the phenotypic effect of a variation [1–4].
tion [25]. In this study we want to address whether A comparative study of protein stability predictors
the concept of the in silico determined molecular by Blundell and co-workers demonstrated that al-
phenotype can be employed for large-scale classifica- though protein stability changes caused by mutation
tion of SNPs and disease mutations. The attempt to can be relatively accurately estimated in silico, these
classify a large set of mutations based on an incom-
1
predictions by themselves do not yield accuracy on structure quality are applied. Our standard restric-
large-scale classification between benign and disrup- tions on building high-confidence structural models
tive mutations [5–7]. using the FoldX force field are X-ray structures with
Furthermore, computational analyses rely heav- a resolution lower than 2.5 Å and sequence identity
ily on the quality of the data under scrutiny and higher than 80%. Applying these restrictions to the
the computational methods used to evaluate these Ensembl data results in a data set of 5416 nsSNPs
data. Before investigating 39 structural properties of (circa 4% of the data, Figure S1B).
proteins and amino acid substitutions for their pre-
dictive power regarding SNP classification, we have
investigated what major liabilities are encountered Predictability of structural properties
when implementing an structural approach to SNP The second issue for a large-scale structural bioinfor-
annotation and classification. The results are com- matics approach is the structural properties that are
pared with those achieved by the best performers predictable with state of the art tools: how well can
among the state-of-the-art tools. we describe the structural behaviour of a protein and
its mutants? Previous structural studies have iden-
tified protein stability, aggregation and misfolding
as determinants of correct functioning on the single
Results and Discussion
protein level [7,11,12]. Mutations affecting the func-
In this study we have identified the common issues tional sites of a protein, such as DNA, ligand and
that are encountered when performing large-scale protein interaction sites, are not considered within
analyses of structural properties of human coding this scope, but the investigation of these sites will
variation. The first issue concerns the availability of most certainly be of great importance to assess the
structural data for nsSNPs and disease mutations, impact of amino acid substitutions.
while the second involves the availability of compu- Tools have been developed that describe the
tational tools to predict structural properties. The structure and dynamics of a protein: stability, ag-
last issue concerns the quality of classification: are gregation, amyloidosis, and folding. We have used
the training and evaluation data sets used in the computational methods that are capable of assessing
analyses sufficient to extrapolate results for larger the effects of a mutation on protein stability (FoldX),
studies, and do the properties used have sufficient aggregation (Tango) and amyloidosis (Waltz). Al-
predictive power to separate the two data sets? though algorithms exist that can predict folding of
small single domain proteins (e.g. Rosetta [13],
FoldX [14], SimFold [15]), to date no computational
Structural coverage of human genetic variation method exists that can predict folding events on
Despite structural genomics projects, the gap be- large multi-domain proteins, or that is applicable in
tween sequence and structural information is still genome wide studies.
wide, and the coverage of variation data with struc- Although we have not investigated protein-
tural data is estimated to be as low as 14% [4]. We protein interactions in this study, we have included
have investigated the boundaries of structural cov- an analysis of the binding of proteins to molecular
erage by varying the quality requirements on the chaperones, as it is directly related to correct folding
structural model (Supplementary Figure S1A), the of the protein. The high abundance of chaperones in
sequence identity between query sequence and mod- the cell emphasises their crucial role in the cell [16],
elled structure (Figure S1B), the percentage of the but this is not reflected in the availability of compu-
wild type sequence covered by the structural model tational tools for chaperone binding. We have used
(Figure S1C), and the length of the alignment be- the only available tool, the Hsp70 binding predic-
tween query and target (Figure S1D). Without ap- tor Limbo [17], to assess chaperone binding variation
plying any restrictions, about 12% of all nsSNPs caused by amino acid alteration.
present in the Ensembl Variation Database (release
44) can be mapped on a structural model, in accor-
dance with the estimate cited previously. However, The predictive power of structural properties
this percentage is valid only when no restrictions Following the recommendations of Care et al [18],
regarding sequence identity, sequence coverage or we have used the SwissProt annotated disease and
2
polymorphism data (SwissProt Variation Index re- ied between the minimal and maximal values mea-
lease 52) as the evaluation data for our analyses. sure for the specific property, and the true and false
Mapping of these variants on high quality structural positive rate, and the Matthews correlation coeffi-
models (X-ray structures with resolution ≤ 2.5Å, se- cient (MCC) were calculated for each cut-off value.
quence identity with the model above 80%) yielded Table 3 lists the data for both the best MCC and
a data set of 240 positive (disease-associated) muta- the MCC90, i.e. the coefficient that is measured at
tions and 400 negative variations (neutral nsSNPs) high specificity (true negative rate = 90%). The
in 98 proteins. To ensure that the analyses are com- corresponding ROC curves for these analyses can be
parable, we applied the sequence based predictors to found in Supplementary Figure S1.
the same small data set as the predictors that use The same strategy was then applied to predicted
3D structures or structural models. values of structural differences between mutant and
Before we evaluated the discriminative power of wild type proteins (24 properties). Statistics were
the individual structural parameters, we wanted to calculated for stability and entropy parameters, as
assess whether our data showed distinguishable pat- well as for differences concerning protein aggrega-
terns for three important parameters. The first two tion, amyloidosis and chaperone binding (Table 4,
criteria, stability difference and the degree of burial Supplementary Figure S2).
of the mutation site, have previously been identi- The results obtained from these detailed analy-
fied as providing information about the severity of ses are unanimous: none of the parameters evaluated
a mutation [4, 19]. The third criterion is difference can be used to separate the data. All MCC values
in aggregation propensity, which has been cited as are close to zero, and thus the predictions are no bet-
likely to be an important factor in disease suscepti- ter than a random predictor would perform on the
bility [12, 20] but thus far has not been applied in a data. The high accuracy of FoldX for stability esti-
proteome wide mutation analysis. mation has been proven in various studies [6,9,10], so
Figure 1 shows the distributions for the stabil- we have high confidence in our stability estimations.
ity differences (A) and differences in aggregation In accordance with the analyses of [7], we find that
propensity (B) between wild type and variant pro- high stability differences alone are no sufficient crite-
teins, and the burial of the mutation site (C). The rion to distinguish deleterious mutations and neutral
first observation of both the stability and the ag- variation. These results show that the dominant ef-
gregation analysis is that the observed changes are fect of for instance stability that was proposed in
not discrete but follow a smooth distribution from earlier large-scale studies [4, 22] can not be always
negative to positive change. Second, there are no- generalised for other data.
ticeable differences between SNPs and disease muta- The fact that none of the properties representing
tions, but they cannot be distinguished by a simple conformational differences between wild type and
cut-off value on the output, as there is large over- variant protein contain enough information to distin-
lap between the distributions. This is confirmed by guish neutral and deleterious variation implies that
the P-values obtained from paired student t-tests, large-scale classification based on singular structural
which are 0.96 for the stability distributions, 0.99 properties is not feasible and requires a better un-
for the aggregation distributions, and 0.99 for the derstanding of how the complex interplay between
burial distributions, respectively. For the stability biophysical and biochemical properties of a protein
distributions, we see that disease mutations are gen- conspire to different tolerance for mutations in dif-
erally more destabilising than SNPs, but their distri- ferent proteins.
butions overlap largely. A similar analysis has been Recent studies that combine structural and evo-
performed on SwissProt variants using the Site Di- lutionary information using machine learning tech-
rected Mutator stability predictor [7], and the distri- niques are able to classify relatively large data sets
butions of stability differences of disease mutations obtained for the SwissProt database successfully
and neutral variations are similar to our findings. (summarised in Table S2). Machine learning ap-
In a first series of properties to test as classifiers, proaches suggest that data integration is indeed the
we have investigated 15 properties of the amino acid way forward, but the creation of this black box style
substitution site that contribute to the assessment of of classifier does not offer insight into the biological
the effect of the mutation using the FoldX algorithm processes. In the same way that using evolutionary
(Table 3). Cut off values were generated that var- information to classify SNPs obscures the how and
3
why a specific mutation is deleterious, using black tural bioinformatics tools that were proposed in the
box machine learning methods will not teach us what SNPeffect toolsuite [26] for their ability to act as a
the underlying reason of disease is. Although know- binary classifier for deleterious and neutral SNPs.
ing that an amino acid is critical for correct function Neither of the individual properties that were ex-
is of course useful, in a structural bioinformatics ap- amined could serve this purpose. Because several
proach the focus is more on the molecular mecha- approaches were able to classify similar data sets as
nism underlying disease. the one we have used, we applied the most used evo-
A simple combination of the SNPeffect structural lutionary method, SIFT [23], to our data set. As it
bioinformatics toolsuite on our evaluation data set was not able to classify our data set accurately, we
showed that in our case, at least a linear combina- argued that generalisation of the results presented by
tion of these methods is not sufficient to classify the the state of the art classifiers might be an important
data (TPR = 0.73, TNR= 0.27, MCC=0). A large issue. We illustrated this problem with the variabil-
part of the polymorphism data is predicted to have ity of performance of SIFT on 8 different data sets
deleterious effect. To assess the “predictiveness” of used in various analyses.
our data set, we applied the well-established evolu-
tionary method SIFT [24] to our data and found that
SIFT was also not able to classify effectively. In fact
the results were even worse than our naive classifier From these analyses we concluded that strict
(TPR=0.69, TNR=0.21, MCC=-0.12). classification of SNPs is not feasible at the time, both
As an illustration of the influence of the data because there are still many technical difficulties to
set used for evaluation on the performance of a pre- overcome, and because the biological interpretation
dictor, we list the results for the variation in per- of the molecular phenotype in relation to a disease
formance of SNP classification of SIFT, that uses phenotype is a complex matter. Even at the single
evolutionary information to label SNPs (Supplemen- molecule level, we cannot assess how tolerant a spe-
tary Table S3). The Matthews correlation coefficient cific protein is to structural variation. The inherent
varies between -0.12 on our data set over 0.25 on hu- rigidity of a protein might influence the change in
man mutagenesis data, up to 0.59 on the HIV-1 pro- stability that is allowed before severe conformational
tease mutagenesis set in the original SIFT paper [24]. changes are introduced. Furthermore, on the cellu-
This is yet another informative example on how cru- lar level biological interpretation is even harder: we
cial the choice of training and test data are to build can not predict the role of the protein quality control
and evaluate predictors: generalisation of results is system plays in this tolerance level, not all interac-
only possible when the training data are expressive tions are described at the molecular level, and much
enough to represent the entire feature space. more. Even if we can predict the molecular effect
accurately, this might not necessarily result in a dis-
ease phenotype because of functional redundancy of
the protein.
Conclusions
The concept of using the molecular phenotypic ef-
fect of a nsSNP to assess its effect on the structure
and function of the protein it alters was first intro- However, not being able to classify human varia-
duced by Bork and co-workers [25]. The question tion into disease mutations and neutral or beneficial
has been raised to how much of this molecular phe- variation does not mean that this approach or the
notype is necessary to evaluate the contribution of methods developed are useless. By using high qual-
a SNP to a disease phenotype: are there singular ity bioinformatics tools, we can select from a large
dominant properties that determine the impairment pool of variations the candidates that are interesting
of structure and function, or do we need to consider for detailed investigation. This in itself is a valuable
the full ensemble of molecular properties to interpret contribution, because the amount of variation data
the impact of the SNP? Other research groups have available is too massive to be investigated experi-
proposed that single properties such as stability [4] mentally. In silico analyses can and will be used
and solvent accessibility [1] can be used to classify successfully as an addition to in vitro and in vivo
SNPs. We have examined all the individual struc- studies.
4
Methods References
Assembly of data sets 1. Chasman D, Adams RM: Predicting the func-
tional consequences of non-synonymous single nu-
Statistics on the structural coverage and validation cleotide polymorphisms: Structure-based assess-
status of human non synonymous coding SNPs were ment of amino acid variation. J Mol Biol 2001,
performed on data from the Ensembl human vari- 307(2):683–706.
ation database release 44, containing 12.2 million 2. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza-
SNPs, of which 133698 cause an amino acid vari- tion of disease-associated single amino acid poly-
morphisms in terms of sequence and structure
ation in a known transcript. The mapping of SNPs properties. J Mol Biol 2002, 315(4):771–786.
on protein structures was evaluated using the “ensp- 3. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif
pdbmapping” DAS service provided by the SPICE S, Liang J: Structural location of disease-associated
server [27]. Positive and negative data sets for single-nucleotide polymorphisms. J Mol Biol 2003,
the evaluation of SNP classification were designed 327(5):1021–1030.
with data from the SwissProt variation index [28] in 4. Yue P, Li Z, Moult J: Loss of protein structure sta-
the UniProt knowledge base (version 52.0, March bility as a major causative factor in monogenic
disease. J Mol Biol 2005, 353(2):459–473.
2007, [29]) that were mapped onto known PDB
5. Worth CL, Burke DF, Blundell TL: Estimating the ef-
structures and high quality homologs thereof. The
fects of single nucleotide polymorphisms on pro-
quality criteria described in the results section (mod- tein structure: how good are we at identifying
els with resolution of 3 Åor higher, sequence iden- likely disease associated mutations? In Proceedings
tity of 80% or more) lead to structural models of of Molecular Interactions - Bringing Chemistry to Life
2006.
400 SNPs (negative) and 240 disease associated mu-
tations (positive). 6. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ,
Todd JA, Blundell TL: Genome bioinformatic anal-
ysis of nonsynonymous SNPs. BMC Bioinformatics
2007, 8:301.
Structural bioinformatics tools 7. Worth CL, Bickerton GRJ, Schreyer A, Forman JR,
We have used the FoldX force field [33] for all mu- Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL:
A structural bioinformatics approach to the anal-
tant properties regarding structural location, protein ysis of nonsynonymous single nucleotide polymor-
stability and its various components, the Tango [34] phisms (nsSNPs) and their relation to disease. J
and Waltz [35, submitted] algorithms to assess the Bioinform Comput Biol 2007, 5(6):1297–1318.
propensity for aggregation of wild type and variant 8. Boutselakis H, Dimitropoulos D, Fillon J, Golovin A,
proteins, and the Limbo algorithm [17, submitted] to Henrick K, Hussain A, Ionides J, John M, Keller PA,
Krissinel E, McNeil P, Naim A, Newman R, Oldfield T,
evaluate the chaperone-binding properties of amino Pineda J, Rachedi A, Copeland J, Sitnov A, Sobhany
acid sequences. A novel tool developed by Lenaerts S, Suarez-Uruena A, Swaminathan J, Tagari M, Tate J,
et al (unpublished) was used to estimate the en- Tromm S, Velankar S, Vranken W: E-MSD: the Eu-
tropy of a specific amino acid site in a high-resolution ropean Bioinformatics Institute Macromolecular
Structure Database. Nucleic Acids Res 2003, 31:458–
structure. Detailed descriptions of these five tools 462.
can be found in the Supplementary Material.
9. Guerois R, Nielsen JE, Serrano L: Predicting changes
in the stability of proteins and protein complexes:
A study of more than 1000 mutations. J Mol Biol
2002, 320(2):369–387.
Authors contributions
10. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik
Conceived and designed the experiments: JR JS FR. DS: The stability effects of protein mutations ap-
Performed the experiments: JR. Analysed the data: pear to be universally distributed. J Mol Biol 2007,
JR JS FR. Wrote the paper: JR. 369(5):1318–1332.
11. Steward RE, MacArthur MW, Laskowski RA, Thornton
JM: Molecular basis of inherited diseases: a struc-
tural perspective. Trends Genet 2003, 19(9):505–513.
Acknowledgements 12. DePristo M, Weinreich D, Hartl D: Missense meander-
Joke Reumers was supported by a grant from the Federal ings in sequence space: A biophysical view of pro-
Research Office (FWO, IUAP P6/43), Belgium, and the tein evolution. Nature Reviews Genetics 2005, AOP.
Institute for the encouragement of Scientific Research 13. Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio
and Innovation of Brussels (ISRIB), Belgium. protein structure prediction of CASP III targets
using ROSETTA. Proteins 1999, Suppl 3:171–176.
5
14. Serrano L, Guerois R: Fold-X: An algorithm to pre- 26. Reumers J, Conde L, Medina I, Maurer-Stroh S,
dict and engineer folding pathways. Abstr Pap Am Van Durme J, Dopazo J, Rousseau F, Schymkowitz J:
Chem Soc 2001, 221:U395–U395. Joint annotation of coding and non-coding single
15. Fujitsuka Y, Chikenji G, Takada S: SimFold energy nucleotide polymorphisms and mutations in the
function for de novo protein structure prediction: SNPeffect and PupaSuite databases. Nucleic Acids
consensus with Rosetta. Proteins 2006, 62(2):381– Res 2008, 36(Database issue):D825–9.
398.
27. Prlic A, Down TA, Hubbard TJ: Adding some SPICE
16. Soti C, Csermely P: Protein stress and stress pro- to DAS. Bioinformatics 2005, 21 Suppl 2:ii40–1.
teins: implications in aging and disease. J Biosci
2007, 32(3):511–515. 28. Yip YL, Famiglietti M, Gos A, Duek PD, David FPA,
17. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau Gateau A, Bairoch A: Annotating single amino acid
F, Schymkowitz J: Accurate prediction of the se- polymorphisms in the UniProt/Swiss-Prot knowl-
quence determinants of DnaK-peptide binding via edgebase. Hum Mutat 2008, 29(3):361–366.
a method that integrates homology modelling and
29. UniProt Consortium: The Universal Protein
experimental data. Submitted 2007.
Resource (UniProt). Nucleic Acids Res 2007,
18. Care MA, Needham CJ, Bulpitt AJ, Westhead DR: Dele- 35(Database issue):D193–7.
terious SNP prediction: be mindful of your train-
ing data! Bioinformatics 2007, 23(6):664–672. 30. Zweig MH, Campbell G: Receiver-operating charac-
19. Ramensky V, Bork P, Sunyaev S: Human non- teristic (ROC) plots: a fundamental evaluation
synonymous SNPs: server and survey. Nucleic Acid tool in clinical medicine. Clin Chem 1993, 39(4):561–
Res 2002, 30(17):3894–3900. 577.
20. Worth CL, Blundell TL: Estimating the effects of 31. Matthews BW: Comparison of the predicted
SNPs on protein structure: loss of protein inter- and observed secondary structure of T4 phage
actions and stability as indicators of mis-function lysozyme. Biochim Biophys Acta 1975, 405(2):442–451.
and disease-association. Curr Top Biochem Res 2008,
In press. 32. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H:
21. Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang Assessing the accuracy of prediction algorithms
J: TopoSNP: a topographic database of non- for classification: an overview. Bioinformatics 2000,
synonymous single nucleotide polymorphisms 16(5):412–424.
with and without known disease association. Nu-
33. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff-
cleic Acid Res 2004, 32:D520–D522.
Borg J, Stricher F, Serrano L: Prediction of water and
22. Yue P, Melamud E, Moult J: SNPs3D: candidate gene metal binding sites and their affinities by using
and SNP selection for association studies. BMC the Fold-X force field. Proc Natl Acad Sci USA 2005,
Bioinformatics 2006, 7:166. 102(29):10147–10152.
23. Ng PC, Henikoff S: SIFT: predicting amino acid
changes that affect protein function. Nucleic Acid 34. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J,
Res 2003, 31(13):3812–3814. Serrano L: Prediction of sequence-dependent and
mutational effects on the aggregation of peptides
24. Ng PC, Henikoff S: Predicting deleterious amino
and proteins. Nat Biotechnol 2004, 22(10):1302–1306.
acid substitutions. Genome Res 2001, 11(5):863–874.
25. Sunyaev S, Lathe Wr, Bork P: Integration of genome 35. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar-
data and protein structures: prediction of protein tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz
folds, protein interactions and “molecular pheno- J: Accurate prediction of sequence determinants
types” of single nucleotide polymorphisms. Curr of amyloid formation using the Waltz algorithm.
Opin Struct Biol 2001, 11:125–130. Submitted 2007.
Figures
Figure 1 - Distributions for the major structural criteria in the disease and polymorphism datasets.
White = disease mutations, grey = polymorphisms. A. Stability difference as calculated by the FoldX force
field (in kcal.mol−1 ). B. Difference in aggregation propensity as calculated by the Tango algorithm. Values
close to neutral changes (in the range [−50, 50]) are left out for display purposes. C. Distribution of degree
of burial of the amino acid substitution site.
6
Tables
Table 1 - Summary of structural coverage of SNP data.
Several criteria resulting from the above analyses are applied to assess the structural coverage and reliability
of that coverage of human SNPs in the Ensembl database, as well as the overlap of the structural coverage
with quality parameters for the validation and frequency status of the polymorphism data.
Properties # SNPs % SNPs
nsSNPs covered by high quality structural data
No additional criteria 9877 7.4
Sequence coverage>80 or alignment length> 100 8238 6.2
Sequence identity>80 5416 4.1
Sequence coverage>80 or alignment length> 100, 5318 4.0
and sequence identity>80
Highly reliable nsSNPs covered by high quality structural data
Doublehit validation status, MAF>0.01 680 0.51
Doublehit validation status, MAF>0.01, sequence 229 0.17
identity>80
Doublehit validation status, MAF>0.01, sequence 446 0.33
coverage>80 or alignment length> 100
Doublehit validation status, MAF>0.01, sequence 209 0.16
coverage>80 or alignment length> 100, and se-
quence identity>80
Table 2 - Predictive power of structural properties of the modeled variant proteins.
FoldX was used to evaluate both the overall stability contribution of the amino acid substitution site in
the modeled structure and the various factors involved in this stability. The entropy of the variant amino
acid was calculated using a sampling strategy to assess the possible side chain conformations allowed at the
substitution site. Both stability and entropy were calculated for all mutations and for a subset of buried
mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5). Corresponding ROC
curves are shown in Supplementary Figure S2.
Table 1
Property FPR TPR Best MCC Threshold MCC90
FoldX energy evaluation
Overall stability of residue 14 33 0.22 1.61 0.19
Backbone H bond 32 72 0.40 -1.05 0.22
Sidechain H bond 99 100 0.07 -1.76 <0
Electrostatics 86 93 0.11 -0.10 -0.01
Entropy side chain 59 80 0.22 0.32 0.05
Entropy main chain 13 27 0.18 1.96 0.10
Van der Waals contribution 25 47 0.23 -0.98 0.15
Solvation hydrophobic 10 22 0.16 -0.6 0.16
Solvation polar 42 70 0.28 1.5 0.06
Van der Waals clash 18 33 0.17 0.22 0.15
Side chain burial 51 67 0.16 0.43 -0.1
Main chain burial 59 83 0.26 0.73 0.05
Entropy by sampling of possible side chain conformations
Entropy side chain 72 84 0.15 0.93 0
7
Table 3 - Predictive power of the differences between wild type and variant proteins for different
structural properties.
FoldX was used to evaluate both the overall stability difference between wild type and variant structure, and
the constituting contributions leading to this stability difference. The entropy difference caused by the amino
acid substitution was calculated using a sampling strategy to assess the possible side chain conformations
allowed at the substitution site. Both stability and entropy difference were calculated for all mutations and
for a subset of buried mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5).
Corresponding ROC curves are shown in Supplementary Figure S3.
Property FPR TPR Best Threshold MCC90
MCC
FoldX energy evaluation
Overall stability difference 73 85 0.15 -0.45 0.14
Overall stability diff. (surface) 0 8 0.2 3.1 0.13
Overall stability diff. (buried) 21 44 0.25 2.64 0.12
Backbone clash 91 99 0.18 -1.00 -0.02
Backbone H bond 59 83 0.26 -0.025 0.06
Sidechain H bond 79 92 0.18 -0.13 -0.14
Electrostatics 6 18 0.18 0.15 0.16
Entropy main chain 6 18 0.18 0.15 0.04
Entropy side chain 64 74 0.11 -0.125 -0.05
Solvation hydrophobic 57 75 0.19 -0.15 -0.03
Solvation polar 22 36 0.15 0.20 -0.05
Torsion clash 1 3 0.07 1.00 -0.05
Van der Waals contribution 7 14 0.11 0.89 0.10
Van der Waals clash 98 100 0.10 -1.60 0.02
Entropy difference by sampling of possible side chain conformations
FoldX entropy difference 85 92 0.11 -1.85 -0.02
FoldX entropy diff. (buried) 96 100 0.14 -2.70 -0.05
FoldX entropy diff. (surface) 37 57 0.20 -0.10 0.02
Aggregation properties
Tango 1 3 0.07 39.9 0
Tango (positive, more aggr.) 14 22 0.10 16.37 0
Tango (negative, less aggr.) 69 78 0.10 -8.00 0
Waltz 0 1 0.07 748.97 0
Waltz (positive, more aggr.) 16 21 0.06 677.15 0
Waltz (negative, less aggr.) 99 100 0.07 -2412.78 0
Limbo 17 33 0.18 5.45 0
Additional Files
Figure 1 – figure1.pdf
Additional file 2 — supplementary.pdf
Several of the less critical figures and tables are added as supplementary material, together with detailed
descriptions of the structural bioinformatics tools used.
8
Using structural bioinformatics to investigate the impact of
non synonymous SNPs and disease mutations: scope and
limitations
Supplementary Material
Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1
1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ -
frederic.rousseau@vub.ac.be;
∗ Corresponding author
Methods Entropy calculations based on side chain sampling
In addition to the entropy calculations intrinsic to
Structural bioinformatics tools the FoldX force field, we use a novel method based
on extensive sampling of side chain conformations
as developed by Lenaerts et al. (unpublished). The
FoldX
sampling method produces for each side chain the
probability (P (X)) of finding the residue’s side chain
The FoldX force field was developed for the fast and in a particular conformational state. From these
accurate estimation of the free change upon muta- probabilities entropy can easily be derived:
tion on the stability of a protein or a protein com-
X
plex [23–26]. It uses an all-atom representation of H(X) = − iP (xi )log2 P (xi )
these macromolecules, and has been validated on a
test database of more than 1000 mutants from more
than 20 different proteins. It currently yields a cor- The method uses a rotamer database based on
relation of 0.78 with a standard deviation of 0.41 conditional statistics of dihedral angles derived from
kcal/mol. the WHAT IF data set [27]. All amino acids from
this data and their corresponding dihedral angles
Modelling and evaluation of mutations in FoldX (10◦ bin) were used to derive the following probabili-
is performed with the BuildModel command. It is ties: P (χi ), P (χi |χi−1 ) and P (χi |χi−1 , χi−2 ), except
used first to model a homologous sequence on a for χ1 (P (χ1 ) and P (χ1 |φ, ψ)). A set of n random
structural model and to optimise the side chains to rotamers can be derived from the probability distri-
fit the new sequence, and then to evaluate the effect bution thus calculated. This will allow sampling of
of a single amino acid variation. The Gibbs free en- rotamers with greater resolution than classical ro-
ergy of a protein is calculated with the Stability com- tamer libraries.
mand. The various structural parameters used in The sampling itself is performed by Monte Carlo
the classification tests (backbone clash, backbone H based sampling method with Metropolis criterion (at
bond formation , sidechain H bond formation, elec- 298K). The Metropolis criterion states that a certain
trostatics , solvation of hydrophobic residues, solva- conformational change is accepted with a probabil-
tion of polar residues, torsion clash, Van der Waals ity p that depends on the free energy change ∆∆G
contribution,Van der Waals clash) associated with the conformational change as given
1
by the following formula: the proteins in this extended data set were not pre-
viously known to contain amyloidogenic sequences
p = 1if ∆∆G < 0 such as presenilin-2, titin and myosin. Waltz com-
∆∆G bines terms from amino acid sequence scoring in the
p = e− RT if ∆∆G ≥ 0
learning set, physical property analysis and homol-
The free energy of each change is determined ogy modelling. The method shows 84% sensitivity
with FoldX. at 92% specificity on the AmylHex data set [30],
and correctly identifies mutations in human proteins
known to be associated with amyloid deposition.
Tango
The β-aggregation prediction algorithm Tango [28]
uses a statistical mechanics approach to represent Limbo
a competition between major conformational states: Limbo is a Hsp70 binding site predictor that was
the random coil and the native conformations, as built using a dual method combining sequence and
well as β-turn, α-helix and β-aggregate. Two win- structural information [31, submitted]. Experimen-
dows of variable length slide over the sequence, and tal DnaK binding data of 53 non-redundant pep-
each such window can populate these conformational tide sequences was used to generate a sequence-
states according to a Boltzmann distribution. The based position-specific scoring matrix (PSSM) based
frequency of population of each structural state for on logarithm of the odds scores. Following an
a given segment will be relative to its energy, which in silico alanine scan of the substrate peptide in
is derived from statistical and empirical parameters. the crystal structure of a DnaK-substrate complex
To predict the β-aggregating segments of a peptide, (PDBID 1DKX Zhu1996) using FoldX, a structure-
Tango calculates the partition function of the phase based PSSM that reflects the individual contribution
space involving these conformational states. In our of certain substrate residue types for DnaK bind-
analysis we have used Tango to calculate the differ- ing was generated. The Limbo DnaK binding site
ence in aggregation tendency that results from an predictor was obtained by combining the structure-
single amino acid variation. based PSSM with a normalisation factor of 0.2 with
the sequence-based PSSM. Limbo is able to correctly
predict 89% of the true positives in a tested peptide
Waltz set (high sensitivity), with a concurrent amount of
Current methods for the prediction of the sequence only 5.9% false positives for a specific score threshold
determinants of amyloidosis suffer from two major (high specificity). The robustness of the predictor
problems: overpredicting amorphous cross β aggre- was evaluated with a cross-validation test, resulting
gates and missing amylogenic sequences that are en- in a true positive rate of 72% true positives and a
riched in the polar Q and N residues, such as the false positive rate of 5.9%. The predictor was able
prion protein. The Waltz algorithm [29, submit- to identify an entire known DnaK binding site in the
ted] tackles these problems by taking into account heat-shock promoter σ 32 [32]. We have used Limbo
amyloid hexapeptides from 48 new amyloid form- to rank mutated proteins according to their DnaK
ing sequences, derived from 31 proteins. About half binding affinity.
2
Tables
Supplementary Table S1 - Types of data sets used to train and test SNP classifiers.
Origin data set Size Number References
of data set of studies
Neutral variations
Mutagenesis studies 111-3706 9 [1–9]
Orthologs 888-16682 3 [3, 9, 10]
SwissProt SNP 502-12944 6 [3, 8, 11–14]
OMIM 558 1 [15]
dbSNP 5177-21471 2 [16, 17]
Disease mutations
Mutagenesis studies 159-1750 8 [1–9]
COSMIC database 879 1 [18]
HGMD 3768-10263 1 [9]
OMIM 879-2249 5 [3, 8, 13, 15, 18]
SwissProt Disease 175-9610 9 [3, 8, 10–14, 19, 20]
Data [21] 209 1 [20]
Data [22] 185 2 [19, 20]
Supplementary Table S2 - Performance of state-of-the-art predictors on representative data sets.
The performance of a few selected tools on SwissProt disease associated mutations and SNP data are shown.
Study Method FPR FNR TPR TNR MCC Size
set
Bao et al [11] Random Forest 0.3 0.24 0.76 0.7 0.46 205
Capriotti et al [13] HybridMeth - - - - 0.46 21185
Karchin et al [14] SVM 0.2 0.19 0.81 0.8 0.61 3691
Ng & Henikoff [19] SIFT 0.19 0.31 0.69 0.81 0.50 5333
Wang & Moult [20] Stability 0.3 0.1 0.9 0.7 0.61 262
Worth et al [16] Combined 0.09 0.68 0.32 0.91 0.28 9143
Yue & Moult [9] SVM 0.15 0.26 0.74 0.85 0.59 6077
Supplementary Table S3 - Variation of the performance of SIFT on different data sets.
3
Study Dataset FPR FNR TPR TNR MCC
Bao et al [11] Test set 0.33 0.38 0.62 0.67 0.29
Saunders et al [8] Human 0.4 0.35 0.65 0.6 0.25
Ng & Henikoff [7] lac I repressor 0.22 0.43 0.57 0.78 0.36
Ng & Henikoff [7] HIV 1-protease 0.3 0.12 0.88 0.7 0.59
Ng & Henikoff [7] T4 lysozyme 0.41 0.28 0.72 0.59 0.31
Ng & Henikoff [19] SwissProt disease 0.19 0.31 0.69 0.81 0.50
Worth et al [16] SwissProt + dbSNP 0.41 0.29 0.71 0.59 0.30
Our evaluation SwissProt 0.79 0.31 0.69 0.21 -0.12
4
Figures
2 104 2 104
1.5 104 1.5 104
Number of SNPs
Number of SNPs
1 104 1 104
5000 5000
A 0
NMR X-RAY High quality X-RAY
B 0
0 20 40 60 80 100
Structure type Sequence identity
2 104 2 104
1.5 104 1.5 104
Number of SNPs
Number of SNPs
1 104 1 104
5000 5000
C 0
0 20 40 60 80 100
D 00 50 100 150 200
Alignment coverage Length of alignment
Figure S1. Structural coverage of Ensembl non synonymous SNP data. A. Number of SNPs in structures
determined by NMR and X-ray crystallography studies or models of these structures. 11% of all non
synonymous SNPs can be mapped on crystallography structures, and 7% of all SNPs can be modeled on a high-quality X-ray
structure (resolution ≤ 2.5Å). B. Number of SNPs covered by structural data versus the sequence identity
between the query sequence and the structural model. The number of SNPs that can be modeled on X-ray structures
(•) decreases from 15% of all nsSNPs (15685 nsSNPs, 5% sequence identity) to 2.5% (3341) of all SNPs for which the
structure of the wild type sequence has been determined experimentally (100% sequence identity). When only high quality
structures are considered (◦), this amount is reduced by half to 7.4% for a sequence identity of 5% and 1.5% for exact models.
C. Number of SNPs covered by structural data versus the sequence coverage of the wild type sequence. There
are almost no SNPs for which the full length of the protein sequence is covered (100% coverage), but for 80% coverage almost
8000 SNPs can be selected, of which circa 5500 in high quality structures. D. Number of SNPs covered by structural
data versus the length of the alignment between protein sequence and structural model. About a third of the
SNPs that can be modeled are located in a structural alignment that is less than 100 amino acids long, both for models based
on all X-ray structures (•) and based on high resolution structures only (◦).
5
Overall energy contribution Backbone H bond
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Side chain H bond Electrostatics
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Entropy side chain Entropy main chain
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S2. ROC curves for classification of disease mutations and neutral variation by using structural properties of the
amino acid substitution site.
6
Van der Waals contribution Solvation hydrophobic
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Main chain burial Side chain burial
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Solvation polar Van der Waals clash
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
properties of the amino acid substitution site.
7
Entropy side chain Entropy side chain
by sampling strategy by sampling strategy (surface)
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Entropy side chain
by sampling strategy (buried)
100
80
60
TPR
40
20
0
0 20 40 60 80 100
FPR
Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
properties of the amino acid substitution site.
8
Overall stability difference Overall stability difference
(surface residues)
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Overall stability difference
(buried residues) Backbone clash
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Backbone H bond Sidechain H bond
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S3. ROC curves for classification of disease mutations and neutral variation by using structural differences between
the wild type and variant protein.
9
Electrostatics Entropy main chain
100 100
80 80
Electrostatics Entropy main chain
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Entropy side chain Solvation hydrophobic
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Solvation polar Torsion
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
differences between the wild type and variant protein.
10
Van der Waals contribution Van der Waals clash
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Entropy side chain Entropy side chain
by sampling strategy by sampling strategy (buried)
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Entropy side chain
by sampling strategy (surface) Tango
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
differences between the wild type and variant protein.
11
Tango (positive scores) Tango (negative scores)
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Waltz Waltz (positive scores)
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Waltz (negative scores) Limbo
100 100
80 80
60 60
TPR
TPR
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
FPR FPR
Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
differences between the wild type and variant protein.
12
References 16. Worth CL, Bickerton GRJ, Schreyer A, Forman JR,
1. Chasman D, Adams RM: Predicting the func- Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL:
tional consequences of non-synonymous single nu- A structural bioinformatics approach to the anal-
cleotide polymorphisms: Structure-based assess- ysis of nonsynonymous single nucleotide polymor-
ment of amino acid variation. J Mol Biol 2001, phisms (nsSNPs) and their relation to disease. J
307(2):683–706. Bioinform Comput Biol 2007, 5(6):1297–1318.
2. Clifford RJ, Edmonson MN, Nguyen C, Buetow KH: 17. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ,
Large-scale analysis of non-synonymous coding Todd JA, Blundell TL: Genome bioinformatic anal-
region single nucleotide polymorphisms. Bioinfor- ysis of nonsynonymous SNPs. BMC Bioinformatics
matics 2004, 20(7):1006–1014. 2007, 8:301.
3. Ferrer-Costa C, Orozco M, de la Cruz X: Sequence- 18. Worth CL, Burke DF, Blundell TL: Estimating the ef-
based prediction of pathological mutations. Pro- fects of single nucleotide polymorphisms on pro-
teins 2004, 57(4):811–819. tein structure: how good are we at identifying
4. Jiang R, Yang H, Sun F, Chen T: Searching for inter- likely disease associated mutations? In Proceedings
pretable rules for disease mutations: a simulated of Molecular Interactions - Bringing Chemistry to Life
annealing bump hunting strategy. BMC Bioinfor- 2006.
matics 2006, 7:417. 19. Ng PC, Henikoff S: Accounting for human poly-
5. Krishnan VG, Westhead DR: A comparative study of morphisms predicted to affect protein function.
machine-learning methods to predict the effects of Genome Res 2002, 12(3):436–446.
single nucleotide polymorphisms on protein func- 20. Wang Z, Moult J: SNPs, protein structure, and dis-
tion. Bioinformatics 2003, 19(17):2199–2209. ease. Hum Mutat 2001, 17(4):263–270.
6. Needham CJ, Bradford JR, Bulpitt AJ, Care MA, West- 21. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N,
head DR: Predicting the effect of missense muta- Weder A, Cooper R, Lipshutz R, Chakravarti A: Pat-
tions on protein function: analysis with Bayesian terns of single-nucleotide polymorphisms in can-
networks. BMC Bioinformatics 2006, 7:405. didate genes for blood-pressure homeostasis. Nat
7. Ng PC, Henikoff S: Predicting deleterious amino Genet 1999, 22(3):239–247.
acid substitutions. Genome Res 2001, 11(5):863–874. 22. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K,
8. Saunders CT, Baker D: Evaluation of structural and Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N,
evolutionary contributions to deleterious muta- Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington
tion prediction. J Mol Biol 2002, 322(4):891–901. J, Lipshutz R, Daley GQ, Lander ES: Characterization
of single-nucleotide polymorphisms in coding re-
9. Yue P, Li Z, Moult J: Loss of protein structure sta- gions of human genes (vol 22, pg 231, 1999). Nat
bility as a major causative factor in monogenic Genet 1999, 23(3):373–373.
disease. J Mol Biol 2005, 353(2):459–473.
23. Serrano L, Guerois R: Fold-X: An algorithm to pre-
10. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza- dict and engineer folding pathways. Abstr Pap Am
tion of disease-associated single amino acid poly- Chem Soc 2001, 221:U395–U395.
morphisms in terms of sequence and structure
properties. J Mol Biol 2002, 315(4):771–786. 24. Guerois R, Nielsen JE, Serrano L: Predicting changes
in the stability of proteins and protein complexes:
11. Bao L, Cui Y: Prediction of the phenotypic ef-
A study of more than 1000 mutations. J Mol Biol
fects of non-synonymous single nucleotide poly-
2002, 320(2):369–387.
morphisms using structural and evolutionary in-
formation. Bioinformatics 2005, 21(10):2185–2190. 25. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F,
Serrano L: The FoldX web server: an online force
12. Bao L, Cui Y: Functional impacts of non-
field. Nucleic Acid Res 2005, 33:W382–W388.
synonymous single nucleotide polymorphisms:
Selective constraint and structural environments. 26. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff-
FEBS Lett 2006, 580(5):1231–4. Borg J, Stricher F, Serrano L: Prediction of water and
metal binding sites and their affinities by using
13. Capriotti E, Calabrese R, Casadio R: Predicting the in-
the Fold-X force field. Proc Natl Acad Sci USA 2005,
surgence of human genetic diseases associated to
102(29):10147–10152.
single point protein mutations with support vec-
tor machines and evolutionary information. Bioin- 27. Vriend G: What If - a molecular modeling and drug
formatics 2006, 22(22):2729–2734. design program. J Mol Graph 1990, 8:52–.
14. Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper 28. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J,
U, Eswar N, Haussler D, Sali A: LS-SNP: large- Serrano L: Prediction of sequence-dependent and
scale annotation of coding non-synonymous SNPs mutational effects on the aggregation of peptides
based on multiple information sources. Bioinfor- and proteins. Nat Biotechnol 2004, 22(10):1302–1306.
matics 2005, 21(12):2814–2820. 29. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar-
15. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz
S, Liang J: Structural location of disease-associated J: Accurate prediction of sequence determinants
single-nucleotide polymorphisms. J Mol Biol 2003, of amyloid formation using the Waltz algorithm.
327(5):1021–1030. Submitted 2007.
13
30. Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI, a method that integrates homology modelling and
Baker D, Eisenberg D: The 3D profile method for experimental data. Submitted 2007.
identifying fibril-forming segments of proteins.
32. McCarty JS, Rudiger S, Schonfeld HJ, Schneider-
Proc Natl Acad Sci U S A 2006, 103(11):4074–4078.
Mergener J, Nakahigashi K, Yura T, Bukau B: Regula-
tory region C of the E. coli heat shock transcrip-
31. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau tion factor, sigma32, constitutes a DnaK binding
F, Schymkowitz J: Accurate prediction of the se- site and is conserved among eubacteria. J Mol Biol
quence determinants of DnaK-peptide binding via 1996, 256(5):829–37.
14