=Paper= {{Paper |id=Vol-429/paper-2 |storemode=property |title=Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations |pdfUrl=https://ceur-ws.org/Vol-429/paper2.pdf |volume=Vol-429 |dblpUrl=https://dblp.org/rec/conf/eccb/ReumersSR08 }} ==Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations== https://ceur-ws.org/Vol-429/paper2.pdf
Using structural bioinformatics to investigate the impact of
non synonymous SNPs and disease mutations: scope and
limitations
Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1

1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium



Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ -
frederic.rousseau@vub.ac.be;

∗ Corresponding author




Abstract
Background: Linking structural effects of mutations to functional outcomes is a major issue in structural bioin-
formatics, and many tools and studies have shown that specific structural properties such as stability and residue
burial can be used to distinguish neutral variations and disease associated mutations.
Results: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot
Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties
can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the
annotation process from mutation to result and identified the liabilities in each step.
Conclusions: Although excellent annotation results of various research groups underline the great potential of using
structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations
cannot always be extrapolated to proteome wide variation studies. Difficulties for large-scale studies can be found
both on the technical level, i.e. the scarcity of data and the incompleteness of the structural tool suites, and on
the conceptual level, i.e. the correct interpretation of the results in a cellular context.




Background                                                           plete molecular phenotype may seem naive at first
                                                                     glance, had it not been suggested that individual
The molecular phenotype of a coding non synony-                      properties such as protein stability, the accessibility
mous SNP or disease associated mutation describes                    of the amino acid substitution site, and the location
the functional and structural properties of a protein                of variants in surface pockets are predictive deter-
that are affected by a single amino acid substitu-                   minants of the phenotypic effect of a variation [1–4].
tion [25]. In this study we want to address whether                  A comparative study of protein stability predictors
the concept of the in silico determined molecular                    by Blundell and co-workers demonstrated that al-
phenotype can be employed for large-scale classifica-                though protein stability changes caused by mutation
tion of SNPs and disease mutations. The attempt to                   can be relatively accurately estimated in silico, these
classify a large set of mutations based on an incom-


                                                                 1
predictions by themselves do not yield accuracy on            structure quality are applied. Our standard restric-
large-scale classification between benign and disrup-         tions on building high-confidence structural models
tive mutations [5–7].                                         using the FoldX force field are X-ray structures with
    Furthermore, computational analyses rely heav-            a resolution lower than 2.5 Å and sequence identity
ily on the quality of the data under scrutiny and             higher than 80%. Applying these restrictions to the
the computational methods used to evaluate these              Ensembl data results in a data set of 5416 nsSNPs
data. Before investigating 39 structural properties of        (circa 4% of the data, Figure S1B).
proteins and amino acid substitutions for their pre-
dictive power regarding SNP classification, we have
investigated what major liabilities are encountered           Predictability of structural properties
when implementing an structural approach to SNP               The second issue for a large-scale structural bioinfor-
annotation and classification. The results are com-           matics approach is the structural properties that are
pared with those achieved by the best performers              predictable with state of the art tools: how well can
among the state-of-the-art tools.                             we describe the structural behaviour of a protein and
                                                              its mutants? Previous structural studies have iden-
                                                              tified protein stability, aggregation and misfolding
                                                              as determinants of correct functioning on the single
Results and Discussion
                                                              protein level [7,11,12]. Mutations affecting the func-
In this study we have identified the common issues            tional sites of a protein, such as DNA, ligand and
that are encountered when performing large-scale              protein interaction sites, are not considered within
analyses of structural properties of human coding             this scope, but the investigation of these sites will
variation. The first issue concerns the availability of       most certainly be of great importance to assess the
structural data for nsSNPs and disease mutations,             impact of amino acid substitutions.
while the second involves the availability of compu-               Tools have been developed that describe the
tational tools to predict structural properties. The          structure and dynamics of a protein: stability, ag-
last issue concerns the quality of classification: are        gregation, amyloidosis, and folding. We have used
the training and evaluation data sets used in the             computational methods that are capable of assessing
analyses sufficient to extrapolate results for larger         the effects of a mutation on protein stability (FoldX),
studies, and do the properties used have sufficient           aggregation (Tango) and amyloidosis (Waltz). Al-
predictive power to separate the two data sets?               though algorithms exist that can predict folding of
                                                              small single domain proteins (e.g. Rosetta [13],
                                                              FoldX [14], SimFold [15]), to date no computational
Structural coverage of human genetic variation                method exists that can predict folding events on
Despite structural genomics projects, the gap be-             large multi-domain proteins, or that is applicable in
tween sequence and structural information is still            genome wide studies.
wide, and the coverage of variation data with struc-               Although we have not investigated protein-
tural data is estimated to be as low as 14% [4]. We           protein interactions in this study, we have included
have investigated the boundaries of structural cov-           an analysis of the binding of proteins to molecular
erage by varying the quality requirements on the              chaperones, as it is directly related to correct folding
structural model (Supplementary Figure S1A), the              of the protein. The high abundance of chaperones in
sequence identity between query sequence and mod-             the cell emphasises their crucial role in the cell [16],
elled structure (Figure S1B), the percentage of the           but this is not reflected in the availability of compu-
wild type sequence covered by the structural model            tational tools for chaperone binding. We have used
(Figure S1C), and the length of the alignment be-             the only available tool, the Hsp70 binding predic-
tween query and target (Figure S1D). Without ap-              tor Limbo [17], to assess chaperone binding variation
plying any restrictions, about 12% of all nsSNPs              caused by amino acid alteration.
present in the Ensembl Variation Database (release
44) can be mapped on a structural model, in accor-
dance with the estimate cited previously. However,            The predictive power of structural properties
this percentage is valid only when no restrictions            Following the recommendations of Care et al [18],
regarding sequence identity, sequence coverage or             we have used the SwissProt annotated disease and


                                                          2
polymorphism data (SwissProt Variation Index re-                  ied between the minimal and maximal values mea-
lease 52) as the evaluation data for our analyses.                sure for the specific property, and the true and false
Mapping of these variants on high quality structural              positive rate, and the Matthews correlation coeffi-
models (X-ray structures with resolution ≤ 2.5Å, se-             cient (MCC) were calculated for each cut-off value.
quence identity with the model above 80%) yielded                 Table 3 lists the data for both the best MCC and
a data set of 240 positive (disease-associated) muta-             the MCC90, i.e. the coefficient that is measured at
tions and 400 negative variations (neutral nsSNPs)                high specificity (true negative rate = 90%). The
in 98 proteins. To ensure that the analyses are com-              corresponding ROC curves for these analyses can be
parable, we applied the sequence based predictors to              found in Supplementary Figure S1.
the same small data set as the predictors that use                    The same strategy was then applied to predicted
3D structures or structural models.                               values of structural differences between mutant and
    Before we evaluated the discriminative power of               wild type proteins (24 properties). Statistics were
the individual structural parameters, we wanted to                calculated for stability and entropy parameters, as
assess whether our data showed distinguishable pat-               well as for differences concerning protein aggrega-
terns for three important parameters. The first two               tion, amyloidosis and chaperone binding (Table 4,
criteria, stability difference and the degree of burial           Supplementary Figure S2).
of the mutation site, have previously been identi-                    The results obtained from these detailed analy-
fied as providing information about the severity of               ses are unanimous: none of the parameters evaluated
a mutation [4, 19]. The third criterion is difference             can be used to separate the data. All MCC values
in aggregation propensity, which has been cited as                are close to zero, and thus the predictions are no bet-
likely to be an important factor in disease suscepti-             ter than a random predictor would perform on the
bility [12, 20] but thus far has not been applied in a            data. The high accuracy of FoldX for stability esti-
proteome wide mutation analysis.                                  mation has been proven in various studies [6,9,10], so
    Figure 1 shows the distributions for the stabil-              we have high confidence in our stability estimations.
ity differences (A) and differences in aggregation                In accordance with the analyses of [7], we find that
propensity (B) between wild type and variant pro-                 high stability differences alone are no sufficient crite-
teins, and the burial of the mutation site (C). The               rion to distinguish deleterious mutations and neutral
first observation of both the stability and the ag-               variation. These results show that the dominant ef-
gregation analysis is that the observed changes are               fect of for instance stability that was proposed in
not discrete but follow a smooth distribution from                earlier large-scale studies [4, 22] can not be always
negative to positive change. Second, there are no-                generalised for other data.
ticeable differences between SNPs and disease muta-                   The fact that none of the properties representing
tions, but they cannot be distinguished by a simple               conformational differences between wild type and
cut-off value on the output, as there is large over-              variant protein contain enough information to distin-
lap between the distributions. This is confirmed by               guish neutral and deleterious variation implies that
the P-values obtained from paired student t-tests,                large-scale classification based on singular structural
which are 0.96 for the stability distributions, 0.99              properties is not feasible and requires a better un-
for the aggregation distributions, and 0.99 for the               derstanding of how the complex interplay between
burial distributions, respectively. For the stability             biophysical and biochemical properties of a protein
distributions, we see that disease mutations are gen-             conspire to different tolerance for mutations in dif-
erally more destabilising than SNPs, but their distri-            ferent proteins.
butions overlap largely. A similar analysis has been                  Recent studies that combine structural and evo-
performed on SwissProt variants using the Site Di-                lutionary information using machine learning tech-
rected Mutator stability predictor [7], and the distri-           niques are able to classify relatively large data sets
butions of stability differences of disease mutations             obtained for the SwissProt database successfully
and neutral variations are similar to our findings.               (summarised in Table S2). Machine learning ap-
    In a first series of properties to test as classifiers,       proaches suggest that data integration is indeed the
we have investigated 15 properties of the amino acid              way forward, but the creation of this black box style
substitution site that contribute to the assessment of            of classifier does not offer insight into the biological
the effect of the mutation using the FoldX algorithm              processes. In the same way that using evolutionary
(Table 3). Cut off values were generated that var-                information to classify SNPs obscures the how and


                                                              3
why a specific mutation is deleterious, using black           tural bioinformatics tools that were proposed in the
box machine learning methods will not teach us what           SNPeffect toolsuite [26] for their ability to act as a
the underlying reason of disease is. Although know-           binary classifier for deleterious and neutral SNPs.
ing that an amino acid is critical for correct function       Neither of the individual properties that were ex-
is of course useful, in a structural bioinformatics ap-       amined could serve this purpose. Because several
proach the focus is more on the molecular mecha-              approaches were able to classify similar data sets as
nism underlying disease.                                      the one we have used, we applied the most used evo-
    A simple combination of the SNPeffect structural          lutionary method, SIFT [23], to our data set. As it
bioinformatics toolsuite on our evaluation data set           was not able to classify our data set accurately, we
showed that in our case, at least a linear combina-           argued that generalisation of the results presented by
tion of these methods is not sufficient to classify the       the state of the art classifiers might be an important
data (TPR = 0.73, TNR= 0.27, MCC=0). A large                  issue. We illustrated this problem with the variabil-
part of the polymorphism data is predicted to have            ity of performance of SIFT on 8 different data sets
deleterious effect. To assess the “predictiveness” of         used in various analyses.
our data set, we applied the well-established evolu-
tionary method SIFT [24] to our data and found that
SIFT was also not able to classify effectively. In fact
the results were even worse than our naive classifier              From these analyses we concluded that strict
(TPR=0.69, TNR=0.21, MCC=-0.12).                              classification of SNPs is not feasible at the time, both
    As an illustration of the influence of the data           because there are still many technical difficulties to
set used for evaluation on the performance of a pre-          overcome, and because the biological interpretation
dictor, we list the results for the variation in per-         of the molecular phenotype in relation to a disease
formance of SNP classification of SIFT, that uses             phenotype is a complex matter. Even at the single
evolutionary information to label SNPs (Supplemen-            molecule level, we cannot assess how tolerant a spe-
tary Table S3). The Matthews correlation coefficient          cific protein is to structural variation. The inherent
varies between -0.12 on our data set over 0.25 on hu-         rigidity of a protein might influence the change in
man mutagenesis data, up to 0.59 on the HIV-1 pro-            stability that is allowed before severe conformational
tease mutagenesis set in the original SIFT paper [24].        changes are introduced. Furthermore, on the cellu-
This is yet another informative example on how cru-           lar level biological interpretation is even harder: we
cial the choice of training and test data are to build        can not predict the role of the protein quality control
and evaluate predictors: generalisation of results is         system plays in this tolerance level, not all interac-
only possible when the training data are expressive           tions are described at the molecular level, and much
enough to represent the entire feature space.                 more. Even if we can predict the molecular effect
                                                              accurately, this might not necessarily result in a dis-
                                                              ease phenotype because of functional redundancy of
                                                              the protein.
Conclusions
The concept of using the molecular phenotypic ef-
fect of a nsSNP to assess its effect on the structure
and function of the protein it alters was first intro-            However, not being able to classify human varia-
duced by Bork and co-workers [25]. The question               tion into disease mutations and neutral or beneficial
has been raised to how much of this molecular phe-            variation does not mean that this approach or the
notype is necessary to evaluate the contribution of           methods developed are useless. By using high qual-
a SNP to a disease phenotype: are there singular              ity bioinformatics tools, we can select from a large
dominant properties that determine the impairment             pool of variations the candidates that are interesting
of structure and function, or do we need to consider          for detailed investigation. This in itself is a valuable
the full ensemble of molecular properties to interpret        contribution, because the amount of variation data
the impact of the SNP? Other research groups have             available is too massive to be investigated experi-
proposed that single properties such as stability [4]         mentally. In silico analyses can and will be used
and solvent accessibility [1] can be used to classify         successfully as an addition to in vitro and in vivo
SNPs. We have examined all the individual struc-              studies.


                                                          4
Methods                                                        References
Assembly of data sets                                           1. Chasman D, Adams RM: Predicting the func-
                                                                   tional consequences of non-synonymous single nu-
Statistics on the structural coverage and validation               cleotide polymorphisms: Structure-based assess-
status of human non synonymous coding SNPs were                    ment of amino acid variation. J Mol Biol 2001,
performed on data from the Ensembl human vari-                     307(2):683–706.
ation database release 44, containing 12.2 million              2. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza-
SNPs, of which 133698 cause an amino acid vari-                    tion of disease-associated single amino acid poly-
                                                                   morphisms in terms of sequence and structure
ation in a known transcript. The mapping of SNPs                   properties. J Mol Biol 2002, 315(4):771–786.
on protein structures was evaluated using the “ensp-            3. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif
pdbmapping” DAS service provided by the SPICE                      S, Liang J: Structural location of disease-associated
server [27]. Positive and negative data sets for                   single-nucleotide polymorphisms. J Mol Biol 2003,
the evaluation of SNP classification were designed                 327(5):1021–1030.
with data from the SwissProt variation index [28] in            4. Yue P, Li Z, Moult J: Loss of protein structure sta-
the UniProt knowledge base (version 52.0, March                    bility as a major causative factor in monogenic
                                                                   disease. J Mol Biol 2005, 353(2):459–473.
2007, [29]) that were mapped onto known PDB
                                                                5. Worth CL, Burke DF, Blundell TL: Estimating the ef-
structures and high quality homologs thereof. The
                                                                   fects of single nucleotide polymorphisms on pro-
quality criteria described in the results section (mod-            tein structure: how good are we at identifying
els with resolution of 3 Åor higher, sequence iden-               likely disease associated mutations? In Proceedings
tity of 80% or more) lead to structural models of                  of Molecular Interactions - Bringing Chemistry to Life
                                                                   2006.
400 SNPs (negative) and 240 disease associated mu-
tations (positive).                                             6. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ,
                                                                   Todd JA, Blundell TL: Genome bioinformatic anal-
                                                                   ysis of nonsynonymous SNPs. BMC Bioinformatics
                                                                   2007, 8:301.
Structural bioinformatics tools                                 7. Worth CL, Bickerton GRJ, Schreyer A, Forman JR,
We have used the FoldX force field [33] for all mu-                Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL:
                                                                   A structural bioinformatics approach to the anal-
tant properties regarding structural location, protein             ysis of nonsynonymous single nucleotide polymor-
stability and its various components, the Tango [34]               phisms (nsSNPs) and their relation to disease. J
and Waltz [35, submitted] algorithms to assess the                 Bioinform Comput Biol 2007, 5(6):1297–1318.
propensity for aggregation of wild type and variant             8. Boutselakis H, Dimitropoulos D, Fillon J, Golovin A,
proteins, and the Limbo algorithm [17, submitted] to               Henrick K, Hussain A, Ionides J, John M, Keller PA,
                                                                   Krissinel E, McNeil P, Naim A, Newman R, Oldfield T,
evaluate the chaperone-binding properties of amino                 Pineda J, Rachedi A, Copeland J, Sitnov A, Sobhany
acid sequences. A novel tool developed by Lenaerts                 S, Suarez-Uruena A, Swaminathan J, Tagari M, Tate J,
et al (unpublished) was used to estimate the en-                   Tromm S, Velankar S, Vranken W: E-MSD: the Eu-
tropy of a specific amino acid site in a high-resolution           ropean Bioinformatics Institute Macromolecular
                                                                   Structure Database. Nucleic Acids Res 2003, 31:458–
structure. Detailed descriptions of these five tools               462.
can be found in the Supplementary Material.
                                                                9. Guerois R, Nielsen JE, Serrano L: Predicting changes
                                                                   in the stability of proteins and protein complexes:
                                                                   A study of more than 1000 mutations. J Mol Biol
                                                                   2002, 320(2):369–387.
Authors contributions
                                                               10. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik
Conceived and designed the experiments: JR JS FR.                  DS: The stability effects of protein mutations ap-
Performed the experiments: JR. Analysed the data:                  pear to be universally distributed. J Mol Biol 2007,
JR JS FR. Wrote the paper: JR.                                     369(5):1318–1332.
                                                               11. Steward RE, MacArthur MW, Laskowski RA, Thornton
                                                                   JM: Molecular basis of inherited diseases: a struc-
                                                                   tural perspective. Trends Genet 2003, 19(9):505–513.
Acknowledgements                                               12. DePristo M, Weinreich D, Hartl D: Missense meander-
Joke Reumers was supported by a grant from the Federal             ings in sequence space: A biophysical view of pro-
Research Office (FWO, IUAP P6/43), Belgium, and the                tein evolution. Nature Reviews Genetics 2005, AOP.
Institute for the encouragement of Scientific Research         13. Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio
and Innovation of Brussels (ISRIB), Belgium.                       protein structure prediction of CASP III targets
                                                                   using ROSETTA. Proteins 1999, Suppl 3:171–176.


                                                           5
14. Serrano L, Guerois R: Fold-X: An algorithm to pre-        26. Reumers J, Conde L, Medina I, Maurer-Stroh S,
    dict and engineer folding pathways. Abstr Pap Am              Van Durme J, Dopazo J, Rousseau F, Schymkowitz J:
    Chem Soc 2001, 221:U395–U395.                                 Joint annotation of coding and non-coding single
15. Fujitsuka Y, Chikenji G, Takada S: SimFold energy             nucleotide polymorphisms and mutations in the
    function for de novo protein structure prediction:            SNPeffect and PupaSuite databases. Nucleic Acids
    consensus with Rosetta. Proteins 2006, 62(2):381–             Res 2008, 36(Database issue):D825–9.
    398.
                                                              27. Prlic A, Down TA, Hubbard TJ: Adding some SPICE
16. Soti C, Csermely P: Protein stress and stress pro-            to DAS. Bioinformatics 2005, 21 Suppl 2:ii40–1.
    teins: implications in aging and disease. J Biosci
    2007, 32(3):511–515.                                      28. Yip YL, Famiglietti M, Gos A, Duek PD, David FPA,
17. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau            Gateau A, Bairoch A: Annotating single amino acid
    F, Schymkowitz J: Accurate prediction of the se-              polymorphisms in the UniProt/Swiss-Prot knowl-
    quence determinants of DnaK-peptide binding via               edgebase. Hum Mutat 2008, 29(3):361–366.
    a method that integrates homology modelling and
                                                              29. UniProt Consortium:      The Universal Protein
    experimental data. Submitted 2007.
                                                                  Resource (UniProt). Nucleic Acids Res 2007,
18. Care MA, Needham CJ, Bulpitt AJ, Westhead DR: Dele-           35(Database issue):D193–7.
    terious SNP prediction: be mindful of your train-
    ing data! Bioinformatics 2007, 23(6):664–672.             30. Zweig MH, Campbell G: Receiver-operating charac-
19. Ramensky V, Bork P, Sunyaev S: Human non-                     teristic (ROC) plots: a fundamental evaluation
    synonymous SNPs: server and survey. Nucleic Acid              tool in clinical medicine. Clin Chem 1993, 39(4):561–
    Res 2002, 30(17):3894–3900.                                   577.
20. Worth CL, Blundell TL: Estimating the effects of          31. Matthews BW: Comparison of the predicted
    SNPs on protein structure: loss of protein inter-             and observed secondary structure of T4 phage
    actions and stability as indicators of mis-function           lysozyme. Biochim Biophys Acta 1975, 405(2):442–451.
    and disease-association. Curr Top Biochem Res 2008,
    In press.                                                 32. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H:
21. Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang           Assessing the accuracy of prediction algorithms
    J: TopoSNP: a topographic database of non-                    for classification: an overview. Bioinformatics 2000,
    synonymous single nucleotide polymorphisms                    16(5):412–424.
    with and without known disease association. Nu-
                                                              33. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff-
    cleic Acid Res 2004, 32:D520–D522.
                                                                  Borg J, Stricher F, Serrano L: Prediction of water and
22. Yue P, Melamud E, Moult J: SNPs3D: candidate gene             metal binding sites and their affinities by using
    and SNP selection for association studies. BMC                the Fold-X force field. Proc Natl Acad Sci USA 2005,
    Bioinformatics 2006, 7:166.                                   102(29):10147–10152.
23. Ng PC, Henikoff S: SIFT: predicting amino acid
    changes that affect protein function. Nucleic Acid        34. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J,
    Res 2003, 31(13):3812–3814.                                   Serrano L: Prediction of sequence-dependent and
                                                                  mutational effects on the aggregation of peptides
24. Ng PC, Henikoff S: Predicting deleterious amino
                                                                  and proteins. Nat Biotechnol 2004, 22(10):1302–1306.
    acid substitutions. Genome Res 2001, 11(5):863–874.
25. Sunyaev S, Lathe Wr, Bork P: Integration of genome        35. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar-
    data and protein structures: prediction of protein            tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz
    folds, protein interactions and “molecular pheno-             J: Accurate prediction of sequence determinants
    types” of single nucleotide polymorphisms. Curr               of amyloid formation using the Waltz algorithm.
    Opin Struct Biol 2001, 11:125–130.                            Submitted 2007.



Figures
Figure 1 - Distributions for the major structural criteria in the disease and polymorphism datasets.
White = disease mutations, grey = polymorphisms. A. Stability difference as calculated by the FoldX force
field (in kcal.mol−1 ). B. Difference in aggregation propensity as calculated by the Tango algorithm. Values
close to neutral changes (in the range [−50, 50]) are left out for display purposes. C. Distribution of degree
of burial of the amino acid substitution site.




                                                          6
Tables
Table 1 - Summary of structural coverage of SNP data.
Several criteria resulting from the above analyses are applied to assess the structural coverage and reliability
of that coverage of human SNPs in the Ensembl database, as well as the overlap of the structural coverage
with quality parameters for the validation and frequency status of the polymorphism data.


     Properties                                             # SNPs         % SNPs
     nsSNPs covered by high quality structural data
     No additional criteria                                    9877              7.4
     Sequence coverage>80 or alignment length> 100             8238              6.2
     Sequence identity>80                                      5416              4.1
     Sequence coverage>80 or alignment length> 100,            5318              4.0
     and sequence identity>80
     Highly reliable nsSNPs covered by high quality structural data
     Doublehit validation status, MAF>0.01                      680           0.51
     Doublehit validation status, MAF>0.01, sequence            229           0.17
     identity>80
     Doublehit validation status, MAF>0.01, sequence            446           0.33
     coverage>80 or alignment length> 100
     Doublehit validation status, MAF>0.01, sequence            209           0.16
     coverage>80 or alignment length> 100, and se-
     quence identity>80


Table 2 - Predictive power of structural properties of the modeled variant proteins.
FoldX was used to evaluate both the overall stability contribution of the amino acid substitution site in
the modeled structure and the various factors involved in this stability. The entropy of the variant amino
acid was calculated using a sampling strategy to assess the possible side chain conformations allowed at the
substitution site. Both stability and entropy were calculated for all mutations and for a subset of buried
mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5). Corresponding ROC
curves are shown in Supplementary Figure S2.


                                            Table 1
     Property                      FPR TPR Best MCC                   Threshold        MCC90
     FoldX energy evaluation
     Overall stability of residue    14       33       0.22              1.61           0.19
     Backbone H bond                 32       72       0.40              -1.05          0.22
     Sidechain H bond                99      100       0.07              -1.76           <0
     Electrostatics                  86       93       0.11              -0.10          -0.01
     Entropy side chain              59       80       0.22              0.32           0.05
     Entropy main chain              13       27       0.18              1.96           0.10
     Van der Waals contribution      25       47       0.23              -0.98          0.15
     Solvation hydrophobic           10       22       0.16               -0.6          0.16
     Solvation polar                 42       70       0.28                1.5          0.06
     Van der Waals clash             18       33       0.17              0.22           0.15
     Side chain burial               51       67       0.16              0.43            -0.1
     Main chain burial               59       83       0.26              0.73           0.05
     Entropy by sampling of possible side chain conformations
     Entropy side chain              72       84       0.15               0.93           0

                                                       7
Table 3 - Predictive power of the differences between wild type and variant proteins for different
structural properties.
FoldX was used to evaluate both the overall stability difference between wild type and variant structure, and
the constituting contributions leading to this stability difference. The entropy difference caused by the amino
acid substitution was calculated using a sampling strategy to assess the possible side chain conformations
allowed at the substitution site. Both stability and entropy difference were calculated for all mutations and
for a subset of buried mutations (side chain burial < 0.5) and surface mutations (side chain burial ≥ 0.5).
Corresponding ROC curves are shown in Supplementary Figure S3.


     Property                           FPR     TPR       Best    Threshold      MCC90
                                                          MCC
     FoldX energy evaluation
     Overall stability difference       73       85     0.15       -0.45           0.14
     Overall stability diff. (surface)  0        8       0.2         3.1           0.13
     Overall stability diff. (buried)   21       44     0.25        2.64           0.12
     Backbone clash                     91       99     0.18       -1.00           -0.02
     Backbone H bond                    59       83     0.26      -0.025           0.06
     Sidechain H bond                   79       92     0.18       -0.13           -0.14
     Electrostatics                     6        18     0.18        0.15           0.16
     Entropy main chain                 6        18     0.18        0.15           0.04
     Entropy side chain                 64       74     0.11      -0.125           -0.05
     Solvation hydrophobic              57       75     0.19       -0.15           -0.03
     Solvation polar                    22       36     0.15        0.20           -0.05
     Torsion clash                      1        3      0.07        1.00           -0.05
     Van der Waals contribution         7        14     0.11        0.89           0.10
     Van der Waals clash                98      100     0.10       -1.60           0.02
     Entropy difference by sampling of possible side chain conformations
     FoldX entropy difference           85       92     0.11       -1.85           -0.02
     FoldX entropy diff. (buried)       96      100     0.14       -2.70           -0.05
     FoldX entropy diff. (surface)      37       57     0.20       -0.10           0.02
     Aggregation properties
     Tango                              1        3      0.07        39.9             0
     Tango (positive, more aggr.)       14       22     0.10       16.37             0
     Tango (negative, less aggr.)       69       78     0.10       -8.00             0
     Waltz                              0        1      0.07      748.97             0
     Waltz (positive, more aggr.)       16       21     0.06      677.15             0
     Waltz (negative, less aggr.)       99      100     0.07     -2412.78            0
     Limbo                              17       33     0.18        5.45             0


Additional Files
Figure 1 – figure1.pdf
Additional file 2 — supplementary.pdf
Several of the less critical figures and tables are added as supplementary material, together with detailed
descriptions of the structural bioinformatics tools used.




                                                      8
Using structural bioinformatics to investigate the impact of
non synonymous SNPs and disease mutations: scope and
limitations
Supplementary Material
Joke Reumers1 , Joost Schymkowitz1 and Fréderic Rousseau∗1

1 Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium



Email: Joke Reumers - joke.reumers@vub.ac.be; Joost Schymkowitz - joost.schymkowitz@vub.ac.be; Fréderic Rousseau∗ -
frederic.rousseau@vub.ac.be;

∗ Corresponding author


Methods                                                              Entropy calculations based on side chain sampling
                                                                     In addition to the entropy calculations intrinsic to
Structural bioinformatics tools                                      the FoldX force field, we use a novel method based
                                                                     on extensive sampling of side chain conformations
                                                                     as developed by Lenaerts et al. (unpublished). The
FoldX
                                                                     sampling method produces for each side chain the
                                                                     probability (P (X)) of finding the residue’s side chain
The FoldX force field was developed for the fast and                 in a particular conformational state. From these
accurate estimation of the free change upon muta-                    probabilities entropy can easily be derived:
tion on the stability of a protein or a protein com-
                                                                                              X
plex [23–26]. It uses an all-atom representation of                              H(X) = −         iP (xi )log2 P (xi )
these macromolecules, and has been validated on a
test database of more than 1000 mutants from more
than 20 different proteins. It currently yields a cor-                   The method uses a rotamer database based on
relation of 0.78 with a standard deviation of 0.41                   conditional statistics of dihedral angles derived from
kcal/mol.                                                            the WHAT IF data set [27]. All amino acids from
                                                                     this data and their corresponding dihedral angles
     Modelling and evaluation of mutations in FoldX                  (10◦ bin) were used to derive the following probabili-
is performed with the BuildModel command. It is                      ties: P (χi ), P (χi |χi−1 ) and P (χi |χi−1 , χi−2 ), except
used first to model a homologous sequence on a                       for χ1 (P (χ1 ) and P (χ1 |φ, ψ)). A set of n random
structural model and to optimise the side chains to                  rotamers can be derived from the probability distri-
fit the new sequence, and then to evaluate the effect                bution thus calculated. This will allow sampling of
of a single amino acid variation. The Gibbs free en-                 rotamers with greater resolution than classical ro-
ergy of a protein is calculated with the Stability com-              tamer libraries.
mand. The various structural parameters used in                          The sampling itself is performed by Monte Carlo
the classification tests (backbone clash, backbone H                 based sampling method with Metropolis criterion (at
bond formation , sidechain H bond formation, elec-                   298K). The Metropolis criterion states that a certain
trostatics , solvation of hydrophobic residues, solva-               conformational change is accepted with a probabil-
tion of polar residues, torsion clash, Van der Waals                 ity p that depends on the free energy change ∆∆G
contribution,Van der Waals clash)                                    associated with the conformational change as given


                                                                 1
by the following formula:                                   the proteins in this extended data set were not pre-
                                                            viously known to contain amyloidogenic sequences
                 p = 1if ∆∆G < 0                            such as presenilin-2, titin and myosin. Waltz com-
                     ∆∆G                                    bines terms from amino acid sequence scoring in the
              p = e− RT if ∆∆G ≥ 0
                                                            learning set, physical property analysis and homol-
   The free energy of each change is determined             ogy modelling. The method shows 84% sensitivity
with FoldX.                                                 at 92% specificity on the AmylHex data set [30],
                                                            and correctly identifies mutations in human proteins
                                                            known to be associated with amyloid deposition.
Tango
The β-aggregation prediction algorithm Tango [28]
uses a statistical mechanics approach to represent          Limbo
a competition between major conformational states:          Limbo is a Hsp70 binding site predictor that was
the random coil and the native conformations, as            built using a dual method combining sequence and
well as β-turn, α-helix and β-aggregate. Two win-           structural information [31, submitted]. Experimen-
dows of variable length slide over the sequence, and        tal DnaK binding data of 53 non-redundant pep-
each such window can populate these conformational          tide sequences was used to generate a sequence-
states according to a Boltzmann distribution. The           based position-specific scoring matrix (PSSM) based
frequency of population of each structural state for        on logarithm of the odds scores. Following an
a given segment will be relative to its energy, which       in silico alanine scan of the substrate peptide in
is derived from statistical and empirical parameters.       the crystal structure of a DnaK-substrate complex
To predict the β-aggregating segments of a peptide,         (PDBID 1DKX Zhu1996) using FoldX, a structure-
Tango calculates the partition function of the phase        based PSSM that reflects the individual contribution
space involving these conformational states. In our         of certain substrate residue types for DnaK bind-
analysis we have used Tango to calculate the differ-        ing was generated. The Limbo DnaK binding site
ence in aggregation tendency that results from an           predictor was obtained by combining the structure-
single amino acid variation.                                based PSSM with a normalisation factor of 0.2 with
                                                            the sequence-based PSSM. Limbo is able to correctly
                                                            predict 89% of the true positives in a tested peptide
Waltz                                                       set (high sensitivity), with a concurrent amount of
Current methods for the prediction of the sequence          only 5.9% false positives for a specific score threshold
determinants of amyloidosis suffer from two major           (high specificity). The robustness of the predictor
problems: overpredicting amorphous cross β aggre-           was evaluated with a cross-validation test, resulting
gates and missing amylogenic sequences that are en-         in a true positive rate of 72% true positives and a
riched in the polar Q and N residues, such as the           false positive rate of 5.9%. The predictor was able
prion protein. The Waltz algorithm [29, submit-             to identify an entire known DnaK binding site in the
ted] tackles these problems by taking into account          heat-shock promoter σ 32 [32]. We have used Limbo
amyloid hexapeptides from 48 new amyloid form-              to rank mutated proteins according to their DnaK
ing sequences, derived from 31 proteins. About half         binding affinity.




                                                        2
Tables




Supplementary Table S1 - Types of data sets used to train and test SNP classifiers.


     Origin data set            Size         Number        References
                            of data set     of studies
     Neutral variations
     Mutagenesis studies     111-3706           9           [1–9]
     Orthologs              888-16682           3          [3, 9, 10]
     SwissProt SNP           502-12944          6           [3, 8, 11–14]
     OMIM                       558             1          [15]
     dbSNP                  5177-21471          2          [16, 17]
     Disease mutations
     Mutagenesis studies     159-1750           8           [1–9]
     COSMIC database           879              1          [18]
     HGMD                   3768-10263          1          [9]
     OMIM                    879-2249           5           [3, 8, 13, 15, 18]
     SwissProt Disease       175-9610           9           [3, 8, 10–14, 19, 20]
     Data [21]                 209              1          [20]
     Data [22]                 185              2          [19, 20]


Supplementary Table S2 - Performance of state-of-the-art predictors on representative data sets.
The performance of a few selected tools on SwissProt disease associated mutations and SNP data are shown.


     Study                  Method            FPR    FNR         TPR        TNR     MCC     Size
                                                                                             set
     Bao et al [11]         Random Forest      0.3       0.24     0.76      0.7     0.46    205
     Capriotti et al [13]   HybridMeth          -          -        -         -     0.46   21185
     Karchin et al [14]     SVM                0.2       0.19     0.81      0.8     0.61    3691
     Ng & Henikoff [19]     SIFT              0.19       0.31     0.69      0.81    0.50    5333
     Wang & Moult [20]      Stability          0.3        0.1     0.9       0.7     0.61    262
     Worth et al [16]       Combined          0.09       0.68     0.32      0.91    0.28    9143
     Yue & Moult [9]        SVM               0.15       0.26     0.74      0.85    0.59    6077


Supplementary Table S3 - Variation of the performance of SIFT on different data sets.




                                                     3
Study                Dataset             FPR    FNR    TPR    TNR    MCC
Bao et al [11]       Test set            0.33   0.38   0.62   0.67    0.29
Saunders et al [8]   Human                0.4   0.35   0.65    0.6    0.25
Ng & Henikoff [7]    lac I repressor     0.22   0.43   0.57   0.78    0.36
Ng & Henikoff [7]    HIV 1-protease       0.3   0.12   0.88    0.7    0.59
Ng & Henikoff [7]    T4 lysozyme         0.41   0.28   0.72   0.59    0.31
Ng & Henikoff [19]   SwissProt disease   0.19   0.31   0.69   0.81    0.50
Worth et al [16]     SwissProt + dbSNP   0.41   0.29   0.71   0.59    0.30
Our evaluation       SwissProt           0.79   0.31   0.69   0.21   -0.12




                                           4
Figures
                                2 104                                                                          2 104



                               1.5 104                                                                      1.5 104
              Number of SNPs




                                                                                           Number of SNPs
                                1 104                                                                          1 104



                                 5000                                                                          5000




             A                      0
                                             NMR         X-RAY    High quality X-RAY
                                                                                              B                    0
                                                                                                                        0   20          40      60        80   100
                                                     Structure type                                                                   Sequence identity

                                2 104                                                                          2 104



                               1.5 104                                                                        1.5 104
              Number of SNPs




                                                                                             Number of SNPs




                                1 104                                                                          1 104



                                 5000                                                                           5000



             C                      0
                                         0    20       40      60     80        100
                                                                                                  D 00                           50          100        150    200
                                                   Alignment coverage                                                                 Length of alignment

   Figure S1. Structural coverage of Ensembl non synonymous SNP data. A. Number of SNPs in structures
       determined by NMR and X-ray crystallography studies or models of these structures. 11% of all non
synonymous SNPs can be mapped on crystallography structures, and 7% of all SNPs can be modeled on a high-quality X-ray
     structure (resolution ≤ 2.5Å). B. Number of SNPs covered by structural data versus the sequence identity
between the query sequence and the structural model. The number of SNPs that can be modeled on X-ray structures
     (•) decreases from 15% of all nsSNPs (15685 nsSNPs, 5% sequence identity) to 2.5% (3341) of all SNPs for which the
  structure of the wild type sequence has been determined experimentally (100% sequence identity). When only high quality
structures are considered (◦), this amount is reduced by half to 7.4% for a sequence identity of 5% and 1.5% for exact models.
C. Number of SNPs covered by structural data versus the sequence coverage of the wild type sequence. There
are almost no SNPs for which the full length of the protein sequence is covered (100% coverage), but for 80% coverage almost
 8000 SNPs can be selected, of which circa 5500 in high quality structures. D. Number of SNPs covered by structural
  data versus the length of the alignment between protein sequence and structural model. About a third of the
SNPs that can be modeled are located in a structural alignment that is less than 100 amino acids long, both for models based
                           on all X-ray structures (•) and based on high resolution structures only (◦).




                                                                                       5
                           Overall energy contribution                                    Backbone H bond
                 100                                                       100


                 80                                                        80



                 60                                                        60
           TPR




                                                                     TPR
                 40                                                        40



                 20                                                        20


                  0                                                         0
                       0    20       40         60    80   100                   0   20       40         60    80   100
                                          FPR                                                      FPR
                                 Side chain H bond                                          Electrostatics
                 100                                                       100



                 80                                                        80



                 60                                                        60
           TPR




                                                                     TPR




                 40                                                        40



                 20                                                        20



                  0                                                         0
                       0    20       40         60    80   100                   0   20       40         60    80   100
                                          FPR                                                      FPR

                                 Entropy side chain                                       Entropy main chain
                 100                                                       100



                 80                                                        80



                 60                                                        60
           TPR




                                                                     TPR




                 40                                                        40



                 20                                                        20


                  0                                                         0
                       0    20       40         60    80   100                   0   20       40         60    80   100
                                          FPR                                                      FPR

Figure S2. ROC curves for classification of disease mutations and neutral variation by using structural properties of the
                                             amino acid substitution site.




                                                                 6
                        Van der Waals contribution                               Solvation hydrophobic
              100                                                      100



              80                                                       80



              60                                                       60




                                                                 TPR
        TPR




              40                                                       40



              20                                                       20



               0                                                        0
                    0    20      40         60    80   100                   0   20       40         60    80   100
                                      FPR                                                      FPR

                              Main chain burial                                        Side chain burial
              100                                                      100


              80                                                       80


              60                                                       60
        TPR




                                                                 TPR




              40                                                       40


              20                                                       20


               0                                                        0
                    0    20      40         60    80   100                   0   20       40         60    80   100
                                      FPR                                                      FPR

                              Solvation polar                                         Van der Waals clash
              100                                                      100



              80                                                       80



              60                                                       60
        TPR




                                                                 TPR




              40                                                       40



              20                                                       20



               0                                                        0
                    0    20      40         60    80   100                   0   20       40         60    80   100
                                      FPR                                                      FPR

Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
                               properties of the amino acid substitution site.




                                                             7
                            Entropy side chain                                         Entropy side chain
                           by sampling strategy                                   by sampling strategy (surface)
              100                                                       100


              80                                                        80


              60                                                        60
        TPR




                                                                  TPR
              40                                                        40


              20                                                        20


               0                                                         0
                    0    20      40         60   80     100                   0     20      40         60   80     100
                                      FPR                                                        FPR

                             Entropy side chain
                        by sampling strategy (buried)
              100


               80



               60
        TPR




               40


               20



                0
                    0     20     40         60    80    100
                                      FPR

Figure S2 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
                               properties of the amino acid substitution site.




                                                              8
                            Overall stability difference                             Overall stability difference
                                                                                        (surface residues)
                  100                                                      100


                  80                                                       80



                  60                                                       60
            TPR




                                                                     TPR
                  40                                                       40



                  20                                                       20


                   0                                                        0
                        0   20      40         60   80     100                   0   20       40         60    80   100
                                         FPR                                                       FPR
                            Overall stability difference
                                (buried residues)                                          Backbone clash
                  100                                                      100


                  80                                                       80


                  60                                                       60
            TPR




                                                                     TPR




                  40                                                       40



                  20                                                       20


                   0                                                        0
                        0   20      40         60   80     100                   0   20       40         60    80   100
                                         FPR                                                       FPR

                                 Backbone H bond                                          Sidechain H bond
                  100                                                      100



                  80                                                       80


                  60                                                       60
            TPR




                                                                     TPR




                  40                                                       40



                  20                                                       20


                   0                                                        0
                        0   20      40         60   80     100                   0   20       40         60    80   100
                                         FPR                                                       FPR

Figure S3. ROC curves for classification of disease mutations and neutral variation by using structural differences between
                                            the wild type and variant protein.




                                                                 9



                                  Electrostatics                                          Entropy main chain
                  100                                                      100



                  80                                                       80
                               Electrostatics                                          Entropy main chain
              100                                                       100



              80                                                        80


              60                                                        60
        TPR




                                                                  TPR
              40                                                        40


              20                                                        20



               0                                                         0
                    0   20       40         60    80   100                    0   20       40         60    80   100
                                      FPR                                                       FPR

                             Entropy side chain                                   Solvation hydrophobic
              100                                                       100



              80                                                        80


              60                                                        60
        TPR




                                                                  TPR




              40                                                        40



              20                                                        20


               0                                                         0
                    0   20       40         60    80   100                    0   20       40         60    80   100
                                      FPR                                                       FPR

                              Solvation polar                                               Torsion
              100                                                       100


              80                                                        80


              60                                                        60
        TPR




                                                                  TPR




              40                                                        40


              20                                                        20


               0                                                         0
                    0   20       40         60    80   100                    0   20       40         60    80   100
                                      FPR                                                       FPR

Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
                           differences between the wild type and variant protein.




                                                             10
                         Van der Waals contribution                                        Van der Waals clash
              100                                                         100


              80                                                           80



              60                                                           60
        TPR




                                                                    TPR
              40                                                           40


              20                                                           20



               0                                                            0
                    0     20      40         60   80     100                    0     20       40         60   80   100
                                       FPR                                                          FPR
                             Entropy side chain                                          Entropy side chain
                            by sampling strategy                                    by sampling strategy (buried)
              100                                                         100


              80                                                           80


              60                                                           60
        TPR




                                                                    TPR




              40                                                           40


              20                                                           20


               0                                                            0
                    0     20      40         60   80     100                    0     20       40         60   80   100
                                       FPR                                                          FPR
                             Entropy side chain
                        by sampling strategy (surface)                                           Tango
              100                                                         100


              80                                                          80


              60                                                          60
        TPR




                                                                    TPR




              40                                                          40


              20                                                          20


               0                                                           0
                    0     20      40         60   80     100                    0     20       40         60   80   100
                                       FPR                                                          FPR

Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
                           differences between the wild type and variant protein.




                                                               11
                         Tango (positive scores)                               Tango (negative scores)
               100                                                   100



               80                                                    80



               60                                                    60
         TPR




                                                               TPR
               40                                                    40



               20                                                    20



                0                                                     0
                     0   20    40         60   80   100                    0   20    40         60   80   100
                                    FPR                                                   FPR

                                 Waltz                                         Waltz (positive scores)
              100                                                    100



               80                                                    80



               60                                                    60
        TPR




                                                               TPR




               40                                                    40



               20                                                    20



                0                                                     0
                     0   20    40         60   80   100                    0   20    40         60   80   100
                                    FPR                                                   FPR

                         Waltz (negative scores)                                       Limbo
              100                                                    100



               80                                                     80



               60                                                     60
        TPR




                                                               TPR




               40                                                     40



               20                                                     20



                0                                                      0
                     0   20    40         60   80   100                    0   20    40         60   80   100
                                    FPR                                                   FPR

Figure S3 (continued). ROC curves for classification of disease mutations and neutral variation by using structural
                           differences between the wild type and variant protein.




                                                          12
References                                                         16. Worth CL, Bickerton GRJ, Schreyer A, Forman JR,
 1. Chasman D, Adams RM: Predicting the func-                          Cheng TMK, Lee S, Gong S, Burke DF, Blundell TL:
    tional consequences of non-synonymous single nu-                   A structural bioinformatics approach to the anal-
    cleotide polymorphisms: Structure-based assess-                    ysis of nonsynonymous single nucleotide polymor-
    ment of amino acid variation. J Mol Biol 2001,                     phisms (nsSNPs) and their relation to disease. J
    307(2):683–706.                                                    Bioinform Comput Biol 2007, 5(6):1297–1318.
 2. Clifford RJ, Edmonson MN, Nguyen C, Buetow KH:                 17. Burke DF, Worth CL, Priego EM, Cheng T, Smink LJ,
    Large-scale analysis of non-synonymous coding                      Todd JA, Blundell TL: Genome bioinformatic anal-
    region single nucleotide polymorphisms. Bioinfor-                  ysis of nonsynonymous SNPs. BMC Bioinformatics
    matics 2004, 20(7):1006–1014.                                      2007, 8:301.
 3. Ferrer-Costa C, Orozco M, de la Cruz X: Sequence-              18. Worth CL, Burke DF, Blundell TL: Estimating the ef-
    based prediction of pathological mutations. Pro-                   fects of single nucleotide polymorphisms on pro-
    teins 2004, 57(4):811–819.                                         tein structure: how good are we at identifying
 4. Jiang R, Yang H, Sun F, Chen T: Searching for inter-               likely disease associated mutations? In Proceedings
    pretable rules for disease mutations: a simulated                  of Molecular Interactions - Bringing Chemistry to Life
    annealing bump hunting strategy. BMC Bioinfor-                     2006.
    matics 2006, 7:417.                                            19. Ng PC, Henikoff S: Accounting for human poly-
 5. Krishnan VG, Westhead DR: A comparative study of                   morphisms predicted to affect protein function.
    machine-learning methods to predict the effects of                 Genome Res 2002, 12(3):436–446.
    single nucleotide polymorphisms on protein func-               20. Wang Z, Moult J: SNPs, protein structure, and dis-
    tion. Bioinformatics 2003, 19(17):2199–2209.                       ease. Hum Mutat 2001, 17(4):263–270.
 6. Needham CJ, Bradford JR, Bulpitt AJ, Care MA, West-            21. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N,
    head DR: Predicting the effect of missense muta-                   Weder A, Cooper R, Lipshutz R, Chakravarti A: Pat-
    tions on protein function: analysis with Bayesian                  terns of single-nucleotide polymorphisms in can-
    networks. BMC Bioinformatics 2006, 7:405.                          didate genes for blood-pressure homeostasis. Nat
 7. Ng PC, Henikoff S: Predicting deleterious amino                    Genet 1999, 22(3):239–247.
    acid substitutions. Genome Res 2001, 11(5):863–874.            22. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K,
 8. Saunders CT, Baker D: Evaluation of structural and                 Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N,
    evolutionary contributions to deleterious muta-                    Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington
    tion prediction. J Mol Biol 2002, 322(4):891–901.                  J, Lipshutz R, Daley GQ, Lander ES: Characterization
                                                                       of single-nucleotide polymorphisms in coding re-
 9. Yue P, Li Z, Moult J: Loss of protein structure sta-               gions of human genes (vol 22, pg 231, 1999). Nat
    bility as a major causative factor in monogenic                    Genet 1999, 23(3):373–373.
    disease. J Mol Biol 2005, 353(2):459–473.
                                                                   23. Serrano L, Guerois R: Fold-X: An algorithm to pre-
10. Ferrer-Costa C, Orozco M, de la Cruz X: Characteriza-              dict and engineer folding pathways. Abstr Pap Am
    tion of disease-associated single amino acid poly-                 Chem Soc 2001, 221:U395–U395.
    morphisms in terms of sequence and structure
    properties. J Mol Biol 2002, 315(4):771–786.                   24. Guerois R, Nielsen JE, Serrano L: Predicting changes
                                                                       in the stability of proteins and protein complexes:
11. Bao L, Cui Y: Prediction of the phenotypic ef-
                                                                       A study of more than 1000 mutations. J Mol Biol
    fects of non-synonymous single nucleotide poly-
                                                                       2002, 320(2):369–387.
    morphisms using structural and evolutionary in-
    formation. Bioinformatics 2005, 21(10):2185–2190.              25. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F,
                                                                       Serrano L: The FoldX web server: an online force
12. Bao L, Cui Y: Functional impacts of non-
                                                                       field. Nucleic Acid Res 2005, 33:W382–W388.
    synonymous single nucleotide polymorphisms:
    Selective constraint and structural environments.              26. Schymkowitz JWH, Rousseau F, Martins IC, Ferkinghoff-
    FEBS Lett 2006, 580(5):1231–4.                                     Borg J, Stricher F, Serrano L: Prediction of water and
                                                                       metal binding sites and their affinities by using
13. Capriotti E, Calabrese R, Casadio R: Predicting the in-
                                                                       the Fold-X force field. Proc Natl Acad Sci USA 2005,
    surgence of human genetic diseases associated to
                                                                       102(29):10147–10152.
    single point protein mutations with support vec-
    tor machines and evolutionary information. Bioin-              27. Vriend G: What If - a molecular modeling and drug
    formatics 2006, 22(22):2729–2734.                                  design program. J Mol Graph 1990, 8:52–.
14. Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper              28. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J,
    U, Eswar N, Haussler D, Sali A: LS-SNP: large-                     Serrano L: Prediction of sequence-dependent and
    scale annotation of coding non-synonymous SNPs                     mutational effects on the aggregation of peptides
    based on multiple information sources. Bioinfor-                   and proteins. Nat Biotechnol 2004, 22(10):1302–1306.
    matics 2005, 21(12):2814–2820.                                 29. Maurer-Stroh S, Kuemmerer N, Lopez de la Paz M, Mar-
15. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif             tins I, Reumers J, Serrano L, Rousseau F, Schymkowitz
    S, Liang J: Structural location of disease-associated              J: Accurate prediction of sequence determinants
    single-nucleotide polymorphisms. J Mol Biol 2003,                  of amyloid formation using the Waltz algorithm.
    327(5):1021–1030.                                                  Submitted 2007.


                                                              13
30. Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI,           a method that integrates homology modelling and
    Baker D, Eisenberg D: The 3D profile method for               experimental data. Submitted 2007.
    identifying fibril-forming segments of proteins.
                                                               32. McCarty JS, Rudiger S, Schonfeld HJ, Schneider-
    Proc Natl Acad Sci U S A 2006, 103(11):4074–4078.
                                                                   Mergener J, Nakahigashi K, Yura T, Bukau B: Regula-
                                                                   tory region C of the E. coli heat shock transcrip-
31. Van Durme J, Maurer-Stroh S, Wilkinson H, Rousseau             tion factor, sigma32, constitutes a DnaK binding
    F, Schymkowitz J: Accurate prediction of the se-               site and is conserved among eubacteria. J Mol Biol
    quence determinants of DnaK-peptide binding via                1996, 256(5):829–37.




                                                          14