Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1 1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ - frederic.rousseau@vub.ac.be; ∗ Corresponding author Abstract Background: Linking structural effects of mutations to functional outcomes is a major issue in structural bioin- formatics, and many tools and studies have shown that specific structural properties such as stability and residue burial can be used to distinguish neutral variations and disease associated mutations. Results: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the annotation process from mutation to result and identified the liabilities in each step. Conclusions: Although excellent annotation results of various research groups underline the great potential of using structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations cannot always be extrapolated to proteome wide variation studies. Difficulties for large-scale studies can be found both on the technical level, i.e. the scarcity of data and the incompleteness of the structural tool suites, and on the conceptual level, i.e. the correct interpretation of the results in a cellular context. Background plete molecular phenotype may seem naive at first glance, had it not been suggested that individual The molecular phenotype of a coding non synony- properties such as protein stability, the accessibility mous SNP or disease associated mutation describes of the amino acid substitution site, and the location the functional and structural properties of a protein of variants in surface pockets are predictive deter- that are affected by a single amino acid substitu- minants of the phenotypic effect of a variation [1–4]. tion [25]. In this study we want to address whether A comparative study of protein stability predictors the concept of the in silico determined molecular by Blundell and co-workers demonstrated that al- phenotype can be employed for large-scale classifica- though protein stability changes caused by mutation tion of SNPs and disease mutations. The attempt to can be relatively accurately estimated in silico, these classify a large set of mutations based on an incom- 1 predictions by themselves do not yield accuracy on structure quality are applied. Our standard restric- large-scale classification between benign and disrup- tions on building high-confidence structural models tive mutations [5–7]. using the FoldX force field are X-ray structures with Furthermore, computational analyses rely heav- a resolution lower than 2.5 Å and sequence identity ily on the quality of the data under scrutiny and higher than 80%. Applying these restrictions to the the computational methods used to evaluate these Ensembl data results in a data set of 5416 nsSNPs data. Before investigating 39 structural properties of (circa 4% of the data, Figure S1B). proteins and amino acid substitutions for their pre- dictive power regarding SNP classification, we have investigated what major liabilities are encountered Predictability of structural properties when implementing an structural approach to SNP The second issue for a large-scale structural bioinfor- annotation and classification. The results are com- matics approach is the structural properties that are pared with those achieved by the best performers predictable with state of the art tools: how well can among the state-of-the-art tools. we describe the structural behaviour of a protein and its mutants? Previous structural studies have iden- tified protein stability, aggregation and misfolding as determinants of correct functioning on the single Results and Discussion protein level [7,11,12]. Mutations affecting the func- In this study we have identified the common issues tional sites of a protein, such as DNA, ligand and that are encountered when performing large-scale protein interaction sites, are not considered within analyses of structural properties of human coding this scope, but the investigation of these sites will variation. The first issue concerns the availability of most certainly be of great importance to assess the structural data for nsSNPs and disease mutations, impact of amino acid substitutions. while the second involves the availability of compu- Tools have been developed that describe the tational tools to predict structural properties. The structure and dynamics of a protein: stability, ag- last issue concerns the quality of classification: are gregation, amyloidosis, and folding. We have used the training and evaluation data sets used in the computational methods that are capable of assessing analyses sufficient to extrapolate results for larger the effects of a mutation on protein stability (FoldX), studies, and do the properties used have sufficient aggregation (Tango) and amyloidosis (Waltz). Al- predictive power to separate the two data sets? though algorithms exist that can predict folding of small single domain proteins (e.g. Rosetta [13], FoldX [14], SimFold [15]), to date no computational Structural coverage of human genetic variation method exists that can predict folding events on Despite structural genomics projects, the gap be- large multi-domain proteins, or that is applicable in tween sequence and structural information is still genome wide studies. wide, and the coverage of variation data with struc- Although we have not investigated protein- tural data is estimated to be as low as 14% [4]. We protein interactions in this study, we have included have investigated the boundaries of structural cov- an analysis of the binding of proteins to molecular erage by varying the quality requirements on the chaperones, as it is directly related to correct folding structural model (Supplementary Figure S1A), the of the protein. The high abundance of chaperones in sequence identity between query sequence and mod- the cell emphasises their crucial role in the cell [16], elled structure (Figure S1B), the percentage of the but this is not reflected in the availability of compu- wild type sequence covered by the structural model tational tools for chaperone binding. We have used (Figure S1C), and the length of the alignment be- the only available tool, the Hsp70 binding predic- tween query and target (Figure S1D). Without ap- tor Limbo [17], to assess chaperone binding variation plying any restrictions, about 12% of all nsSNPs caused by amino acid alteration. present in the Ensembl Variation Database (release 44) can be mapped on a structural model, in accor- dance with the estimate cited previously. However, The predictive power of structural properties this percentage is valid only when no restrictions Following the recommendations of Care et al [18], regarding sequence identity, sequence coverage or we have used the SwissProt annotated disease and 2 polymorphism data (SwissProt Variation Index re- ied between the minimal and maximal values mea- lease 52) as the evaluation data for our analyses. sure for the specific property, and the true and false Mapping of these variants on high quality structural positive rate, and the Matthews correlation coeffi- models (X-ray structures with resolution ≤ 2.5Å, se- cient (MCC) were calculated for each cut-off value. quence identity with the model above 80%) yielded Table 3 lists the data for both the best MCC and a data set of 240 positive (disease-associated) muta- the MCC90, i.e. the coefficient that is measured at tions and 400 negative variations (neutral nsSNPs) high specificity (true negative rate = 90%). The in 98 proteins. To ensure that the analyses are com- corresponding ROC curves for these analyses can be parable, we applied the sequence based predictors to found in Supplementary Figure S1. the same small data set as the predictors that use The same strategy was then applied to predicted 3D structures or structural models. values of structural differences between mutant and Before we evaluated the discriminative power of wild type proteins (24 properties). Statistics were the individual structural parameters, we wanted to calculated for stability and entropy parameters, as assess whether our data showed distinguishable pat- well as for differences concerning protein aggrega- terns for three important parameters. The first two tion, amyloidosis and chaperone binding (Table 4, criteria, stability difference and the degree of burial Supplementary Figure S2). of the mutation site, have previously been identi- The results obtained from these detailed analy- fied as providing information about the severity of ses are unanimous: none of the parameters evaluated a mutation [4, 19]. The third criterion is difference can be used to separate the data. All MCC values in aggregation propensity, which has been cited as are close to zero, and thus the predictions are no bet- likely to be an important factor in disease suscepti- ter than a random predictor would perform on the bility [12, 20] but thus far has not been applied in a data. The high accuracy of FoldX for stability esti- proteome wide mutation analysis. mation has been proven in various studies [6,9,10], so Figure 1 shows the distributions for the stabil- we have high confidence in our stability estimations. ity differences (A) and differences in aggregation In accordance with the analyses of [7], we find that propensity (B) between wild type and variant pro- high stability differences alone are no sufficient crite- teins, and the burial of the mutation site (C). The rion to distinguish deleterious mutations and neutral first observation of both the stability and the ag- variation. These results show that the dominant ef- gregation analysis is that the observed changes are fect of for instance stability that was proposed in not discrete but follow a smooth distribution from earlier large-scale studies [4, 22] can not be always negative to positive change. Second, there are no- generalised for other data. ticeable differences between SNPs and disease muta- The fact that none of the properties representing tions, but they cannot be distinguished by a simple conformational differences between wild type and cut-off value on the output, as there is large over- variant protein contain enough information to distin- lap between the distributions. This is confirmed by guish neutral and deleterious variation implies that the P-values obtained from paired student t-tests, large-scale classification based on singular structural which are 0.96 for the stability distributions, 0.99 properties is not feasible and requires a better un- for the aggregation distributions, and 0.99 for the derstanding of how the complex interplay between burial distributions, respectively. For the stability biophysical and biochemical properties of a protein distributions, we see that disease mutations are gen- conspire to different tolerance for mutations in dif- erally more destabilising than SNPs, but their distri- ferent proteins. butions overlap largely. A similar analysis has been Recent studies that combine structural and evo- performed on SwissProt variants using the Site Di- lutionary information using machine learning tech- rected Mutator stability predictor [7], and the distri- niques are able to classify relatively large data sets butions of stability differences of disease mutations obtained for the SwissProt database successfully and neutral variations are similar to our findings. (summarised in Table S2). Machine learning ap- In a first series of properties to test as classifiers, proaches suggest that data integration is indeed the we have investigated 15 properties of the amino acid way forward, but the creation of this black box style substitution site that contribute to the assessment of of classifier does not offer insight into the biological the effect of the mutation using the FoldX algorithm processes. In the same way that using evolutionary (Table 3). Cut off values were generated that var- information to classify SNPs obscures the how and 3 why a specific mutation is deleterious, using black tural bioinformatics tools that were proposed in the box machine learning methods will not teach us what SNPeffect toolsuite [26] for their ability to act as a the underlying reason of disease is. Although know- binary classifier for deleterious and neutral SNPs. ing that an amino acid is critical for correct function Neither of the individual properties that were ex- is of course useful, in a structural bioinformatics ap- amined could serve this purpose. Because several proach the focus is more on the molecular mecha- approaches were able to classify similar data sets as nism underlying disease. the one we have used, we applied the most used evo- A simple combination of the SNPeffect structural lutionary method, SIFT [23], to our data set. As it bioinformatics toolsuite on our evaluation data set was not able to classify our data set accurately, we showed that in our case, at least a linear combina- argued that generalisation of the results presented by tion of these methods is not sufficient to classify the the state of the art classifiers might be an important data (TPR = 0.73, TNR= 0.27, MCC=0). A large issue. We illustrated this problem with the variabil- part of the polymorphism data is predicted to have ity of performance of SIFT on 8 different data sets deleterious effect. To assess the “predictiveness” of used in various analyses. our data set, we applied the well-established evolu- tionary method SIFT [24] to our data and found that SIFT was also not able to classify effectively. In fact the results were even worse than our naive classifier From these analyses we concluded that strict (TPR=0.69, TNR=0.21, MCC=-0.12). classification of SNPs is not feasible at the time, both As an illustration of the influence of the data because there are still many technical difficulties to set used for evaluation on the performance of a pre- overcome, and because the biological interpretation dictor, we list the results for the variation in per- of the molecular phenotype in relation to a disease formance of SNP classification of SIFT, that uses phenotype is a complex matter. Even at the single evolutionary information to label SNPs (Supplemen- molecule level, we cannot assess how tolerant a spe- tary Table S3). The Matthews correlation coefficient cific protein is to structural variation. The inherent varies between -0.12 on our data set over 0.25 on hu- rigidity of a protein might influence the change in man mutagenesis data, up to 0.59 on the HIV-1 pro- stability that is allowed before severe conformational tease mutagenesis set in the original SIFT paper [24]. changes are introduced. Furthermore, on the cellu- This is yet another informative example on how cru- lar level biological interpretation is even harder: we cial the choice of training and test data are to build can not predict the role of the protein quality control and evaluate predictors: generalisation of results is system plays in this tolerance level, not all interac- only possible when the training data are expressive tions are described at the molecular level, and much enough to represent the entire feature space. more. Even if we can predict the molecular effect accurately, this might not necessarily result in a dis- ease phenotype because of functional redundancy of the protein. Conclusions The concept of using the molecular phenotypic ef- fect of a nsSNP to assess its effect on the structure and function of the protein it alters was first intro- However, not being able to classify human varia- duced by Bork and co-workers [25]. The question tion into disease mutations and neutral or beneficial has been raised to how much of this molecular phe- variation does not mean that this approach or the notype is necessary to evaluate the contribution of methods developed are useless. By using high qual- a SNP to a disease phenotype: are there singular ity bioinformatics tools, we can select from a large dominant properties that determine the impairment pool of variations the candidates that are interesting of structure and function, or do we need to consider for detailed investigation. This in itself is a valuable the full ensemble of molecular properties to interpret contribution, because the amount of variation data the impact of the SNP? Other research groups have available is too massive to be investigated experi- proposed that single properties such as stability [4] mentally. In silico analyses can and will be used and solvent accessibility [1] can be used to classify successfully as an addition to in vitro and in vivo SNPs. We have examined all the individual struc- studies. 4 Methods References Assembly of data sets 1. Chasman D, Adams RM: Predicting the func- tional consequences of non-synonymous single nu- Statistics on the structural coverage and validation cleotide polymorphisms: Structure-based assess- status of human non synonymous coding SNPs were ment of amino acid variation. J Mol Biol 2001, performed on data from the Ensembl human vari- 307(2):683–706. ation database release 44, containing 12.2 million 2. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza- SNPs, of which 133698 cause an amino acid vari- tion of disease-associated single amino acid poly- morphisms in terms of sequence and structure ation in a known transcript. The mapping of SNPs properties. J Mol Biol 2002, 315(4):771–786. on protein structures was evaluated using the “ensp- 3. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif pdbmapping” DAS service provided by the SPICE S, Liang J: Structural location of disease-associated server [27]. Positive and negative data sets for single-nucleotide polymorphisms. J Mol Biol 2003, the evaluation of SNP classification were designed 327(5):1021–1030. with data from the SwissProt variation index [28] in 4. Yue P, Li Z, Moult J: Loss of protein structure sta- the UniProt knowledge base (version 52.0, March bility as a major causative factor in monogenic disease. J Mol Biol 2005, 353(2):459–473. 2007, [29]) that were mapped onto known PDB 5. Worth CL, Burke DF, Blundell TL: Estimating the ef- structures and high quality homologs thereof. The fects of single nucleotide polymorphisms on pro- quality criteria described in the results section (mod- tein structure: how good are we at identifying els with resolution of 3 Åor higher, sequence iden- likely disease associated mutations? In Proceedings tity of 80% or more) lead to structural models of of Molecular Interactions - Bringing Chemistry to Life 2006. 400 SNPs (negative) and 240 disease associated mu- tations (positive). 6. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ, Todd JA, Blundell TL: Genome bioinformatic anal- ysis of nonsynonymous SNPs. BMC Bioinformatics 2007, 8:301. Structural bioinformatics tools 7. Worth CL, Bickerton GRJ, Schreyer A, Forman JR, We have used the FoldX force field [33] for all mu- Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL: A structural bioinformatics approach to the anal- tant properties regarding structural location, protein ysis of nonsynonymous single nucleotide polymor- stability and its various components, the Tango [34] phisms (nsSNPs) and their relation to disease. J and Waltz [35, submitted] algorithms to assess the Bioinform Comput Biol 2007, 5(6):1297–1318. propensity for aggregation of wild type and variant 8. Boutselakis H, Dimitropoulos D, Fillon J, Golovin A, proteins, and the Limbo algorithm [17, submitted] to Henrick K, Hussain A, Ionides J, John M, Keller PA, Krissinel E, McNeil P, Naim A, Newman R, Oldfield T, evaluate the chaperone-binding properties of amino Pineda J, Rachedi A, Copeland J, Sitnov A, Sobhany acid sequences. A novel tool developed by Lenaerts S, Suarez-Uruena A, Swaminathan J, Tagari M, Tate J, et al (unpublished) was used to estimate the en- Tromm S, Velankar S, Vranken W: E-MSD: the Eu- tropy of a specific amino acid site in a high-resolution ropean Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Res 2003, 31:458– structure. Detailed descriptions of these five tools 462. can be found in the Supplementary Material. 9. Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations. J Mol Biol 2002, 320(2):369–387. Authors contributions 10. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik Conceived and designed the experiments: JR JS FR. DS: The stability effects of protein mutations ap- Performed the experiments: JR. Analysed the data: pear to be universally distributed. J Mol Biol 2007, JR JS FR. Wrote the paper: JR. 369(5):1318–1332. 11. Steward RE, MacArthur MW, Laskowski RA, Thornton JM: Molecular basis of inherited diseases: a struc- tural perspective. Trends Genet 2003, 19(9):505–513. Acknowledgements 12. DePristo M, Weinreich D, Hartl D: Missense meander- Joke Reumers was supported by a grant from the Federal ings in sequence space: A biophysical view of pro- Research Office (FWO, IUAP P6/43), Belgium, and the tein evolution. Nature Reviews Genetics 2005, AOP. Institute for the encouragement of Scientific Research 13. Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio and Innovation of Brussels (ISRIB), Belgium. protein structure prediction of CASP III targets using ROSETTA. Proteins 1999, Suppl 3:171–176. 5 14. Serrano L, Guerois R: Fold-X: An algorithm to pre- 26. Reumers J, Conde L, Medina I, Maurer-Stroh S, dict and engineer folding pathways. Abstr Pap Am Van Durme J, Dopazo J, Rousseau F, Schymkowitz J: Chem Soc 2001, 221:U395–U395. Joint annotation of coding and non-coding single 15. Fujitsuka Y, Chikenji G, Takada S: SimFold energy nucleotide polymorphisms and mutations in the function for de novo protein structure prediction: SNPeffect and PupaSuite databases. Nucleic Acids consensus with Rosetta. Proteins 2006, 62(2):381– Res 2008, 36(Database issue):D825–9. 398. 27. Prlic A, Down TA, Hubbard TJ: Adding some SPICE 16. Soti C, Csermely P: Protein stress and stress pro- to DAS. Bioinformatics 2005, 21 Suppl 2:ii40–1. teins: implications in aging and disease. J Biosci 2007, 32(3):511–515. 28. Yip YL, Famiglietti M, Gos A, Duek PD, David FPA, 17. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau Gateau A, Bairoch A: Annotating single amino acid F, Schymkowitz J: Accurate prediction of the se- polymorphisms in the UniProt/Swiss-Prot knowl- quence determinants of DnaK-peptide binding via edgebase. Hum Mutat 2008, 29(3):361–366. a method that integrates homology modelling and 29. UniProt Consortium: The Universal Protein experimental data. Submitted 2007. Resource (UniProt). Nucleic Acids Res 2007, 18. Care MA, Needham CJ, Bulpitt AJ, Westhead DR: Dele- 35(Database issue):D193–7. terious SNP prediction: be mindful of your train- ing data! Bioinformatics 2007, 23(6):664–672. 30. Zweig MH, Campbell G: Receiver-operating charac- 19. Ramensky V, Bork P, Sunyaev S: Human non- teristic (ROC) plots: a fundamental evaluation synonymous SNPs: server and survey. Nucleic Acid tool in clinical medicine. Clin Chem 1993, 39(4):561– Res 2002, 30(17):3894–3900. 577. 20. Worth CL, Blundell TL: Estimating the effects of 31. Matthews BW: Comparison of the predicted SNPs on protein structure: loss of protein inter- and observed secondary structure of T4 phage actions and stability as indicators of mis-function lysozyme. Biochim Biophys Acta 1975, 405(2):442–451. and disease-association. Curr Top Biochem Res 2008, In press. 32. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: 21. Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang Assessing the accuracy of prediction algorithms J: TopoSNP: a topographic database of non- for classification: an overview. Bioinformatics 2000, synonymous single nucleotide polymorphisms 16(5):412–424. with and without known disease association. Nu- 33. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff- cleic Acid Res 2004, 32:D520–D522. Borg J, Stricher F, Serrano L: Prediction of water and 22. Yue P, Melamud E, Moult J: SNPs3D: candidate gene metal binding sites and their affinities by using and SNP selection for association studies. BMC the Fold-X force field. Proc Natl Acad Sci USA 2005, Bioinformatics 2006, 7:166. 102(29):10147–10152. 23. Ng PC, Henikoff S: SIFT: predicting amino acid changes that affect protein function. Nucleic Acid 34. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Res 2003, 31(13):3812–3814. Serrano L: Prediction of sequence-dependent and mutational effects on the aggregation of peptides 24. Ng PC, Henikoff S: Predicting deleterious amino and proteins. Nat Biotechnol 2004, 22(10):1302–1306. acid substitutions. Genome Res 2001, 11(5):863–874. 25. Sunyaev S, Lathe Wr, Bork P: Integration of genome 35. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar- data and protein structures: prediction of protein tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz folds, protein interactions and “molecular pheno- J: Accurate prediction of sequence determinants types” of single nucleotide polymorphisms. Curr of amyloid formation using the Waltz algorithm. Opin Struct Biol 2001, 11:125–130. Submitted 2007. Figures Figure 1 - Distributions for the major structural criteria in the disease and polymorphism datasets. White = disease mutations, grey = polymorphisms. A. Stability difference as calculated by the FoldX force field (in kcal.mol−1 ). B. Difference in aggregation propensity as calculated by the Tango algorithm. Values close to neutral changes (in the range [−50, 50]) are left out for display purposes. C. Distribution of degree of burial of the amino acid substitution site. 6 Tables Table 1 - Summary of structural coverage of SNP data. Several criteria resulting from the above analyses are applied to assess the structural coverage and reliability of that coverage of human SNPs in the Ensembl database, as well as the overlap of the structural coverage with quality parameters for the validation and frequency status of the polymorphism data. Properties # SNPs % SNPs nsSNPs covered by high quality structural data No additional criteria 9877 7.4 Sequence coverage>80 or alignment length> 100 8238 6.2 Sequence identity>80 5416 4.1 Sequence coverage>80 or alignment length> 100, 5318 4.0 and sequence identity>80 Highly reliable nsSNPs covered by high quality structural data Doublehit validation status, MAF>0.01 680 0.51 Doublehit validation status, MAF>0.01, sequence 229 0.17 identity>80 Doublehit validation status, MAF>0.01, sequence 446 0.33 coverage>80 or alignment length> 100 Doublehit validation status, MAF>0.01, sequence 209 0.16 coverage>80 or alignment length> 100, and se- quence identity>80 Table 2 - Predictive power of structural properties of the modeled variant proteins. FoldX was used to evaluate both the overall stability contribution of the amino acid substitution site in the modeled structure and the various factors involved in this stability. The entropy of the variant amino acid was calculated using a sampling strategy to assess the possible side chain conformations allowed at the substitution site. Both stability and entropy were calculated for all mutations and for a subset of buried mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5). Corresponding ROC curves are shown in Supplementary Figure S2. Table 1 Property FPR TPR Best MCC Threshold MCC90 FoldX energy evaluation Overall stability of residue 14 33 0.22 1.61 0.19 Backbone H bond 32 72 0.40 -1.05 0.22 Sidechain H bond 99 100 0.07 -1.76 <0 Electrostatics 86 93 0.11 -0.10 -0.01 Entropy side chain 59 80 0.22 0.32 0.05 Entropy main chain 13 27 0.18 1.96 0.10 Van der Waals contribution 25 47 0.23 -0.98 0.15 Solvation hydrophobic 10 22 0.16 -0.6 0.16 Solvation polar 42 70 0.28 1.5 0.06 Van der Waals clash 18 33 0.17 0.22 0.15 Side chain burial 51 67 0.16 0.43 -0.1 Main chain burial 59 83 0.26 0.73 0.05 Entropy by sampling of possible side chain conformations Entropy side chain 72 84 0.15 0.93 0 7 Table 3 - Predictive power of the differences between wild type and variant proteins for different structural properties. FoldX was used to evaluate both the overall stability difference between wild type and variant structure, and the constituting contributions leading to this stability difference. The entropy difference caused by the amino acid substitution was calculated using a sampling strategy to assess the possible side chain conformations allowed at the substitution site. Both stability and entropy difference were calculated for all mutations and for a subset of buried mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5). Corresponding ROC curves are shown in Supplementary Figure S3. Property FPR TPR Best Threshold MCC90 MCC FoldX energy evaluation Overall stability difference 73 85 0.15 -0.45 0.14 Overall stability diff. (surface) 0 8 0.2 3.1 0.13 Overall stability diff. (buried) 21 44 0.25 2.64 0.12 Backbone clash 91 99 0.18 -1.00 -0.02 Backbone H bond 59 83 0.26 -0.025 0.06 Sidechain H bond 79 92 0.18 -0.13 -0.14 Electrostatics 6 18 0.18 0.15 0.16 Entropy main chain 6 18 0.18 0.15 0.04 Entropy side chain 64 74 0.11 -0.125 -0.05 Solvation hydrophobic 57 75 0.19 -0.15 -0.03 Solvation polar 22 36 0.15 0.20 -0.05 Torsion clash 1 3 0.07 1.00 -0.05 Van der Waals contribution 7 14 0.11 0.89 0.10 Van der Waals clash 98 100 0.10 -1.60 0.02 Entropy difference by sampling of possible side chain conformations FoldX entropy difference 85 92 0.11 -1.85 -0.02 FoldX entropy diff. (buried) 96 100 0.14 -2.70 -0.05 FoldX entropy diff. (surface) 37 57 0.20 -0.10 0.02 Aggregation properties Tango 1 3 0.07 39.9 0 Tango (positive, more aggr.) 14 22 0.10 16.37 0 Tango (negative, less aggr.) 69 78 0.10 -8.00 0 Waltz 0 1 0.07 748.97 0 Waltz (positive, more aggr.) 16 21 0.06 677.15 0 Waltz (negative, less aggr.) 99 100 0.07 -2412.78 0 Limbo 17 33 0.18 5.45 0 Additional Files Figure 1 – figure1.pdf Additional file 2 — supplementary.pdf Several of the less critical figures and tables are added as supplementary material, together with detailed descriptions of the structural bioinformatics tools used. 8 Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations Supplementary Material Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1 1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ - frederic.rousseau@vub.ac.be; ∗ Corresponding author Methods Entropy calculations based on side chain sampling In addition to the entropy calculations intrinsic to Structural bioinformatics tools the FoldX force field, we use a novel method based on extensive sampling of side chain conformations as developed by Lenaerts et al. (unpublished). The FoldX sampling method produces for each side chain the probability (P (X)) of finding the residue’s side chain The FoldX force field was developed for the fast and in a particular conformational state. From these accurate estimation of the free change upon muta- probabilities entropy can easily be derived: tion on the stability of a protein or a protein com- X plex [23–26]. It uses an all-atom representation of H(X) = − iP (xi )log2 P (xi ) these macromolecules, and has been validated on a test database of more than 1000 mutants from more than 20 different proteins. It currently yields a cor- The method uses a rotamer database based on relation of 0.78 with a standard deviation of 0.41 conditional statistics of dihedral angles derived from kcal/mol. the WHAT IF data set [27]. All amino acids from this data and their corresponding dihedral angles Modelling and evaluation of mutations in FoldX (10◦ bin) were used to derive the following probabili- is performed with the BuildModel command. It is ties: P (χi ), P (χi |χi−1 ) and P (χi |χi−1 , χi−2 ), except used first to model a homologous sequence on a for χ1 (P (χ1 ) and P (χ1 |φ, ψ)). A set of n random structural model and to optimise the side chains to rotamers can be derived from the probability distri- fit the new sequence, and then to evaluate the effect bution thus calculated. This will allow sampling of of a single amino acid variation. The Gibbs free en- rotamers with greater resolution than classical ro- ergy of a protein is calculated with the Stability com- tamer libraries. mand. The various structural parameters used in The sampling itself is performed by Monte Carlo the classification tests (backbone clash, backbone H based sampling method with Metropolis criterion (at bond formation , sidechain H bond formation, elec- 298K). The Metropolis criterion states that a certain trostatics , solvation of hydrophobic residues, solva- conformational change is accepted with a probabil- tion of polar residues, torsion clash, Van der Waals ity p that depends on the free energy change ∆∆G contribution,Van der Waals clash) associated with the conformational change as given 1 by the following formula: the proteins in this extended data set were not pre- viously known to contain amyloidogenic sequences p = 1if ∆∆G < 0 such as presenilin-2, titin and myosin. Waltz com- ∆∆G bines terms from amino acid sequence scoring in the p = e− RT if ∆∆G ≥ 0 learning set, physical property analysis and homol- The free energy of each change is determined ogy modelling. The method shows 84% sensitivity with FoldX. at 92% specificity on the AmylHex data set [30], and correctly identifies mutations in human proteins known to be associated with amyloid deposition. Tango The β-aggregation prediction algorithm Tango [28] uses a statistical mechanics approach to represent Limbo a competition between major conformational states: Limbo is a Hsp70 binding site predictor that was the random coil and the native conformations, as built using a dual method combining sequence and well as β-turn, α-helix and β-aggregate. Two win- structural information [31, submitted]. Experimen- dows of variable length slide over the sequence, and tal DnaK binding data of 53 non-redundant pep- each such window can populate these conformational tide sequences was used to generate a sequence- states according to a Boltzmann distribution. The based position-specific scoring matrix (PSSM) based frequency of population of each structural state for on logarithm of the odds scores. Following an a given segment will be relative to its energy, which in silico alanine scan of the substrate peptide in is derived from statistical and empirical parameters. the crystal structure of a DnaK-substrate complex To predict the β-aggregating segments of a peptide, (PDBID 1DKX Zhu1996) using FoldX, a structure- Tango calculates the partition function of the phase based PSSM that reflects the individual contribution space involving these conformational states. In our of certain substrate residue types for DnaK bind- analysis we have used Tango to calculate the differ- ing was generated. The Limbo DnaK binding site ence in aggregation tendency that results from an predictor was obtained by combining the structure- single amino acid variation. based PSSM with a normalisation factor of 0.2 with the sequence-based PSSM. Limbo is able to correctly predict 89% of the true positives in a tested peptide Waltz set (high sensitivity), with a concurrent amount of Current methods for the prediction of the sequence only 5.9% false positives for a specific score threshold determinants of amyloidosis suffer from two major (high specificity). The robustness of the predictor problems: overpredicting amorphous cross β aggre- was evaluated with a cross-validation test, resulting gates and missing amylogenic sequences that are en- in a true positive rate of 72% true positives and a riched in the polar Q and N residues, such as the false positive rate of 5.9%. The predictor was able prion protein. The Waltz algorithm [29, submit- to identify an entire known DnaK binding site in the ted] tackles these problems by taking into account heat-shock promoter σ 32 [32]. We have used Limbo amyloid hexapeptides from 48 new amyloid form- to rank mutated proteins according to their DnaK ing sequences, derived from 31 proteins. About half binding affinity. 2 Tables Supplementary Table S1 - Types of data sets used to train and test SNP classifiers. Origin data set Size Number References of data set of studies Neutral variations Mutagenesis studies 111-3706 9 [1–9] Orthologs 888-16682 3 [3, 9, 10] SwissProt SNP 502-12944 6 [3, 8, 11–14] OMIM 558 1 [15] dbSNP 5177-21471 2 [16, 17] Disease mutations Mutagenesis studies 159-1750 8 [1–9] COSMIC database 879 1 [18] HGMD 3768-10263 1 [9] OMIM 879-2249 5 [3, 8, 13, 15, 18] SwissProt Disease 175-9610 9 [3, 8, 10–14, 19, 20] Data [21] 209 1 [20] Data [22] 185 2 [19, 20] Supplementary Table S2 - Performance of state-of-the-art predictors on representative data sets. The performance of a few selected tools on SwissProt disease associated mutations and SNP data are shown. Study Method FPR FNR TPR TNR MCC Size set Bao et al [11] Random Forest 0.3 0.24 0.76 0.7 0.46 205 Capriotti et al [13] HybridMeth - - - - 0.46 21185 Karchin et al [14] SVM 0.2 0.19 0.81 0.8 0.61 3691 Ng & Henikoff [19] SIFT 0.19 0.31 0.69 0.81 0.50 5333 Wang & Moult [20] Stability 0.3 0.1 0.9 0.7 0.61 262 Worth et al [16] Combined 0.09 0.68 0.32 0.91 0.28 9143 Yue & Moult [9] SVM 0.15 0.26 0.74 0.85 0.59 6077 Supplementary Table S3 - Variation of the performance of SIFT on different data sets. 3 Study Dataset FPR FNR TPR TNR MCC Bao et al [11] Test set 0.33 0.38 0.62 0.67 0.29 Saunders et al [8] Human 0.4 0.35 0.65 0.6 0.25 Ng & Henikoff [7] lac I repressor 0.22 0.43 0.57 0.78 0.36 Ng & Henikoff [7] HIV 1-protease 0.3 0.12 0.88 0.7 0.59 Ng & Henikoff [7] T4 lysozyme 0.41 0.28 0.72 0.59 0.31 Ng & Henikoff [19] SwissProt disease 0.19 0.31 0.69 0.81 0.50 Worth et al [16] SwissProt + dbSNP 0.41 0.29 0.71 0.59 0.30 Our evaluation SwissProt 0.79 0.31 0.69 0.21 -0.12 4 Figures 2 104 2 104 1.5 104 1.5 104 Number of SNPs Number of SNPs 1 104 1 104 5000 5000 A 0 NMR X-RAY High quality X-RAY B 0 0 20 40 60 80 100 Structure type Sequence identity 2 104 2 104 1.5 104 1.5 104 Number of SNPs Number of SNPs 1 104 1 104 5000 5000 C 0 0 20 40 60 80 100 D 00 50 100 150 200 Alignment coverage Length of alignment Figure S1. Structural coverage of Ensembl non synonymous SNP data. A. Number of SNPs in structures determined by NMR and X-ray crystallography studies or models of these structures. 11% of all non synonymous SNPs can be mapped on crystallography structures, and 7% of all SNPs can be modeled on a high-quality X-ray structure (resolution ≤ 2.5Å). B. Number of SNPs covered by structural data versus the sequence identity between the query sequence and the structural model. The number of SNPs that can be modeled on X-ray structures (•) decreases from 15% of all nsSNPs (15685 nsSNPs, 5% sequence identity) to 2.5% (3341) of all SNPs for which the structure of the wild type sequence has been determined experimentally (100% sequence identity). When only high quality structures are considered (◦), this amount is reduced by half to 7.4% for a sequence identity of 5% and 1.5% for exact models. C. Number of SNPs covered by structural data versus the sequence coverage of the wild type sequence. There are almost no SNPs for which the full length of the protein sequence is covered (100% coverage), but for 80% coverage almost 8000 SNPs can be selected, of which circa 5500 in high quality structures. D. Number of SNPs covered by structural data versus the length of the alignment between protein sequence and structural model. About a third of the SNPs that can be modeled are located in a structural alignment that is less than 100 amino acids long, both for models based on all X-ray structures (•) and based on high resolution structures only (◦). 5 Overall energy contribution Backbone H bond 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Side chain H bond Electrostatics 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Entropy side chain Entropy main chain 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S2. ROC curves for classification of disease mutations and neutral variation by using structural properties of the amino acid substitution site. 6 Van der Waals contribution Solvation hydrophobic 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Main chain burial Side chain burial 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Solvation polar Van der Waals clash 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural properties of the amino acid substitution site. 7 Entropy side chain Entropy side chain by sampling strategy by sampling strategy (surface) 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Entropy side chain by sampling strategy (buried) 100 80 60 TPR 40 20 0 0 20 40 60 80 100 FPR Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural properties of the amino acid substitution site. 8 Overall stability difference Overall stability difference (surface residues) 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Overall stability difference (buried residues) Backbone clash 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Backbone H bond Sidechain H bond 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S3. ROC curves for classification of disease mutations and neutral variation by using structural differences between the wild type and variant protein. 9 Electrostatics Entropy main chain 100 100 80 80 Electrostatics Entropy main chain 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Entropy side chain Solvation hydrophobic 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Solvation polar Torsion 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural differences between the wild type and variant protein. 10 Van der Waals contribution Van der Waals clash 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Entropy side chain Entropy side chain by sampling strategy by sampling strategy (buried) 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Entropy side chain by sampling strategy (surface) Tango 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural differences between the wild type and variant protein. 11 Tango (positive scores) Tango (negative scores) 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Waltz Waltz (positive scores) 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Waltz (negative scores) Limbo 100 100 80 80 60 60 TPR TPR 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FPR FPR Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural differences between the wild type and variant protein. 12 References 16. Worth CL, Bickerton GRJ, Schreyer A, Forman JR, 1. Chasman D, Adams RM: Predicting the func- Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL: tional consequences of non-synonymous single nu- A structural bioinformatics approach to the anal- cleotide polymorphisms: Structure-based assess- ysis of nonsynonymous single nucleotide polymor- ment of amino acid variation. J Mol Biol 2001, phisms (nsSNPs) and their relation to disease. J 307(2):683–706. Bioinform Comput Biol 2007, 5(6):1297–1318. 2. Clifford RJ, Edmonson MN, Nguyen C, Buetow KH: 17. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ, Large-scale analysis of non-synonymous coding Todd JA, Blundell TL: Genome bioinformatic anal- region single nucleotide polymorphisms. Bioinfor- ysis of nonsynonymous SNPs. BMC Bioinformatics matics 2004, 20(7):1006–1014. 2007, 8:301. 3. Ferrer-Costa C, Orozco M, de la Cruz X: Sequence- 18. Worth CL, Burke DF, Blundell TL: Estimating the ef- based prediction of pathological mutations. Pro- fects of single nucleotide polymorphisms on pro- teins 2004, 57(4):811–819. tein structure: how good are we at identifying 4. Jiang R, Yang H, Sun F, Chen T: Searching for inter- likely disease associated mutations? In Proceedings pretable rules for disease mutations: a simulated of Molecular Interactions - Bringing Chemistry to Life annealing bump hunting strategy. BMC Bioinfor- 2006. matics 2006, 7:417. 19. Ng PC, Henikoff S: Accounting for human poly- 5. Krishnan VG, Westhead DR: A comparative study of morphisms predicted to affect protein function. machine-learning methods to predict the effects of Genome Res 2002, 12(3):436–446. single nucleotide polymorphisms on protein func- 20. Wang Z, Moult J: SNPs, protein structure, and dis- tion. Bioinformatics 2003, 19(17):2199–2209. ease. Hum Mutat 2001, 17(4):263–270. 6. Needham CJ, Bradford JR, Bulpitt AJ, Care MA, West- 21. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, head DR: Predicting the effect of missense muta- Weder A, Cooper R, Lipshutz R, Chakravarti A: Pat- tions on protein function: analysis with Bayesian terns of single-nucleotide polymorphisms in can- networks. BMC Bioinformatics 2006, 7:405. didate genes for blood-pressure homeostasis. Nat 7. Ng PC, Henikoff S: Predicting deleterious amino Genet 1999, 22(3):239–247. acid substitutions. Genome Res 2001, 11(5):863–874. 22. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, 8. Saunders CT, Baker D: Evaluation of structural and Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N, evolutionary contributions to deleterious muta- Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington tion prediction. J Mol Biol 2002, 322(4):891–901. J, Lipshutz R, Daley GQ, Lander ES: Characterization of single-nucleotide polymorphisms in coding re- 9. Yue P, Li Z, Moult J: Loss of protein structure sta- gions of human genes (vol 22, pg 231, 1999). Nat bility as a major causative factor in monogenic Genet 1999, 23(3):373–373. disease. J Mol Biol 2005, 353(2):459–473. 23. Serrano L, Guerois R: Fold-X: An algorithm to pre- 10. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza- dict and engineer folding pathways. Abstr Pap Am tion of disease-associated single amino acid poly- Chem Soc 2001, 221:U395–U395. morphisms in terms of sequence and structure properties. J Mol Biol 2002, 315(4):771–786. 24. Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: 11. Bao L, Cui Y: Prediction of the phenotypic ef- A study of more than 1000 mutations. J Mol Biol fects of non-synonymous single nucleotide poly- 2002, 320(2):369–387. morphisms using structural and evolutionary in- formation. Bioinformatics 2005, 21(10):2185–2190. 25. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L: The FoldX web server: an online force 12. Bao L, Cui Y: Functional impacts of non- field. Nucleic Acid Res 2005, 33:W382–W388. synonymous single nucleotide polymorphisms: Selective constraint and structural environments. 26. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff- FEBS Lett 2006, 580(5):1231–4. Borg J, Stricher F, Serrano L: Prediction of water and metal binding sites and their affinities by using 13. Capriotti E, Calabrese R, Casadio R: Predicting the in- the Fold-X force field. Proc Natl Acad Sci USA 2005, surgence of human genetic diseases associated to 102(29):10147–10152. single point protein mutations with support vec- tor machines and evolutionary information. Bioin- 27. Vriend G: What If - a molecular modeling and drug formatics 2006, 22(22):2729–2734. design program. J Mol Graph 1990, 8:52–. 14. Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper 28. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, U, Eswar N, Haussler D, Sali A: LS-SNP: large- Serrano L: Prediction of sequence-dependent and scale annotation of coding non-synonymous SNPs mutational effects on the aggregation of peptides based on multiple information sources. Bioinfor- and proteins. Nat Biotechnol 2004, 22(10):1302–1306. matics 2005, 21(12):2814–2820. 29. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar- 15. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz S, Liang J: Structural location of disease-associated J: Accurate prediction of sequence determinants single-nucleotide polymorphisms. J Mol Biol 2003, of amyloid formation using the Waltz algorithm. 327(5):1021–1030. Submitted 2007. 13 30. Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI, a method that integrates homology modelling and Baker D, Eisenberg D: The 3D profile method for experimental data. Submitted 2007. identifying fibril-forming segments of proteins. 32. McCarty JS, Rudiger S, Schonfeld HJ, Schneider- Proc Natl Acad Sci U S A 2006, 103(11):4074–4078. Mergener J, Nakahigashi K, Yura T, Bukau B: Regula- tory region C of the E. coli heat shock transcrip- 31. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau tion factor, sigma32, constitutes a DnaK binding F, Schymkowitz J: Accurate prediction of the se- site and is conserved among eubacteria. J Mol Biol quence determinants of DnaK-peptide binding via 1996, 256(5):829–37. 14