-

Emidio Capriotti*§ and Russ B. Altman*‡

0 , Stanford University , Stanford (CA) , United States of America 1 Department of Mathematics and Computer Sciences, University of Balearic Islands , Palma de Mallorca , Spain 2 Departments of Bioengineering 3 and Genetics

Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. The non-synonymous SNPs occurring in coding regions resulting in single amino acid polymorphisms (SAPs) may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performances, the quality of the prediction can be further improved introducing new features derived from the protein three-dimensional structure. Results: In this paper, we present a structure-based machine learning approach to predict disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features from the protein sequence, structure and function information. After dataset balancing, the structure-based method reaches an overall accuracy of 84%, a correlation coefficient of 0.67, and an area under the receiving operating characteristic curve (AUC) of 0.91. When compared with a similar sequencebased predictor, structure-based method results in an increase of the overall accuracy and the AUC ~3%, and 0.06 for the correlation coefficient. Conclusion: This work demonstrates that structural information can increase the accuracy of detecting of disease-related SAPs. Our results also quantify the magnitude of the improvement on a large data. This improvement is in agreement with the previously observed results in the prediction of the protein stability change upon mutation.

protein mutation [4-16]. These algorithms are able to predict the protein stability change [10, 11, 16], the variation in protein functional activity [6] and the insurgence of human pathologies [4, 5, 7-9, 12-15]. The majority of the methods rely on information derived from protein sequence [4, 8, 9, 14], others use protein structure data [12, 17] and knowledge-based information [7, 13, 15]. In this paper we focus our attention on SAPs presenting a new machine learning based method to predict disease-related SAPs using together protein sequence, structural and functional information. We quantified the improvement of the performance resulting from the use of protein structure information.

Results Performance of the method

In the last decades machine learning approaches have been successfully used to address several biological problems and develop new prediction methods. We modified a previously developed predictor introducing new three-dimensional structure information. In particular we use new features to describe the structural environment of the mutation considering a radius shell of 6 Å around the C-α. To quantify the improvement of the accuracy resulting from the use of 3D structure information, we compare the performances of a structure-based method (SVM-3D) with a sequencebased one (SVM-SEQ). In Tab. 1 different accuracy measures for both predictors are reported. The structure-based method results in 3% better overall accuracy and 0.06 better correlation. Comparing the ROC curves (Fig. 1 A), SVM-3D results in 0.02 better Area Under the Curve (AUC) with respect to SVM-SEQ. If 10% of wrong predictions are accepted SVM-3D has 6% more true positive. The output returned by the SVM has been used to calculate the Reliability Index (RI) and filter prediction. If predictions with RI>5 are selected the SVM-3D method results in 90% overall accuracy, 0.81 correlation coefficient on 74% of the whole dataset (see Fig 1 B). Analyzing the predictions of SVMSEQ and SVM-3D methods we found that outputs agree in the 88% of the cases. On this subset the overall accuracy is 86% and the correlation coefficient of the method is 0.73. For the remaining 12% of the predictions, SVM-SEQ method results in a very poor overall accuracy and correlation respectively 37% and -0.25. SVM-3D performs slightly better than a random predictor resulting in 63% overall accuracy and a 0.25 correlation (see Tab 2).

Structure environment analysis

Protein three-dimensional structural information is an important feature to predict the effect of SAPs. The analysis of the protein structure provides information about the environment of the mutation. In fact, the effect of the mutation depends on the position of the mutated residue, if it is buried in the hydrophobic core or exposed on the surface of the protein. In Fig. 2 panel A the distributions of the relative solvent accessible area (RSA) for disease-related and neutral variants are plotted. The two distributions have mean RSA values of 20.6 and 35.7 respectively for disease-related and neutral variants (see Fig 2 panel A). They are significantly different and the Kolmogorov-Smirnov test returns a p-value of 2.8*10-71. We calculated the overall accuracy and correlation coefficient of our method dividing the dataset in 10 bins according to RSA value of the mutated residue. The SVM-3D method shows better performance in the prediction of buried (RSA<20) and highly exposed (RSA>80) residues (see Fig 2 panel B).

Scoring the residue interactions

The protein three-dimensional structure information is important to calculate the interactions between residues far in the sequence but close in the 3D space. We defined two types of interactions: the lost interactions are those missing after the wild-type mutation and the new interactions formed by the mutant residue. In this section we compared the frequency of lost and new interactions related to disease or neutral mutations. We calculated the log odd score for lost and new interactions respectively in panels A and B (see Fig. 3). According to these results, the most deleterious lost contacts are between and Cys-Cys and newly formed interactions between Trp-Trp are the most damaging ones. The missing Cys-Cys interactions could lead to the loss of a disulphide bond and the mutation of a residue into a Tryptophan when close to another Tryptophan could result in stereo-chemical problems.

An example of missing Cys-Cys interaction has been observed in the mutation of Cys163 in the Glycosylasparaginase (Swiss-Prot:ASPG_HUMAN). This mutation is responsible for the insurgence of the Aspartylglucosaminuria (MIM:208400). Looking at the protein structure (Fig 4), we found that the mutation of the Cys163 to Serine results in the loss of the disulfide bridge between Cys163 and Cys179 (respectively Cys140 and Cys156 in the PDB structure 1APY chain A). Interesting example of possible damaging newly formed interaction can be observed in the Thyroid hormone receptor (SwissProt:THB_HUMAN) where the mutation of Arg243 into Tryptophan is cause of the Thyroid hormone resistance (MIM:188570,274300). Analyzing the protein structure (1 !" chain A) we found that the new Tryptophan could be close to another one in position 239. This mutation could result in stereo-chemical problems in the pocket around the position 243 (see Fig 5). Both the examples are correctly predicted by structure-based method and wrongly predicted by the sequence-based algorithm.

Conclusion

We developed a new machine learning approach based on protein structure information to predict the effect of SAPs. The method has been compared to a previously developed sequence-based predictor to quantify the increase of accuracy achieved by protein structure information. Using a balanced set of 6,630 mutations the structure-based method results in about 3% higher accuracy and AUC and 0.06 higher correlation with respect to sequence-based one. Although the increase the accuracy is not extremely high the introduction of structure information can be particularly useful in specific situation providing insight about the disease mechanism like in the cases discussed above. The prediction improvement is in agreement with the previously results observed in the prediction of the protein stability change upon mutation [10].

Methods Datasets

The preformaces of machine learning methods strongly depend from the training set. This is the reason why the selection of a representative set of SAPs is a pivotal issue in the development of predictive algorithms. A previous analysis of different SAPs databases has shown that annotated set of variants from Swiss-Var database is the best available one [18]. According to this observation, we selected our set of SAP from SwissVar release 57.9 (Oct 2009) and we map all the variants on the protein structures available in the Protein Data Bank (PDB) [19]. To reduce the number of sequence alignments between Swiss-Prot sequences and sequences derived from the PDB, we use a precompiled list of correspondences between Swiss-Prot and PDB codes available at the ExPASY web site. Using this list we aligned each pair of sequences using Blast algorithm [20] and filtering out alignment with: i) gaps, ii) sequence identity lower than 100% and iii) shorter than 40 residues. The remaining alignments are used to calculate the correspondence between the Swiss-Prot and PDB residue numerations. In case a mutation maps in more than one protein structure, the one with best resolution has been selected. After this filtering procedure we obtain a set of 4,986 mutations from 784 protein chains. The dataset of variants mapped into protein structures is composed by 3,342 disease-related SAPs and 1,644 neutral polymorphisms. To keep the dataset balanced we doubled the number of neutral variants considering their reverse mutation as neutral. The final set results in 6,630 mutations about equally distributed between disease-related and neutral SAPs.

Implemented SVM-based predictors

The proposed task is to predict whether a given single amino acid polymorphism is a neutral or disease-related. The task is treated as a binary classification problem for the protein upon mutation. The Support Vector Machine (SVM) input features for the structural-based predictor include: the amino acid mutation, the mutation structural environment, the sequence-profile derived features, and a functional-based log-odds score calculated considering the GO classification. The final input vector consists 48 elements: • 20 components encoding for the mutations (Mut) • 21 local protein structure information (3D) • 5 inputs features derived from sequence profile (Prof) • 2 elements encoding for the number of GO term associated to the protein and the

GO log-odd score (LGO).

A similar sequence-based SVM predictor has been used to measure the increase of accuracy resulting from the use of protein three-dimensional structure information. The structure-based SVM differs only in the 21 elements vector encoding for the local protein structure environment (3D) that replaces the 20 elements vector encoding for the sequence environment. More details about the SVM input features have been described in supplementary materials.

Interaction score

The residues interactions are defined considering all the residues within a radius shell of 6 Å around the C-α of the mutated residue. According to this we calculate a log odd score dividing the frequency of lost interactions related to disease by the same type of interactions that have no pathological effect.

Although the mutations could be responsible for protein structural changes, as first approximation, we consider the position of the C-α of the new residue will not change significantly after the mutation. Hence, we consider new interactions those between the mutant residue and the residues previously interacting with the wild-type. A score of the possible damaging effect of lost or new interactions are calculated as follow LCk=log2[f(ck(i,j),D)/f(ck(i,j),N)] [1] where fk(ck(i,j),D) and f(ck(i,j),N) are the frequencies of contacts between residues i and j respectively for disease-related (D) and neutral (N) variants and k is equal to l or n respectively for lost and new interactions.

Accuracy measures

The performances of our methods are evaluated using a 20-fold cross-validation procedure on the whole SAPs dataset. The dataset has been divided keeping the ratio of the disease-related to the neutral polymorphism mutations similar to the original distribution of the whole set. Furthermore, all the proteins in the datasets are clustered according to their sequence similarity with the blastclust program in the BLAST suite [20] by adopting the default value of length coverage equal to 0.9 and the percentage similarity threshold equal to 30%. We kept all the mutations belonging to a protein in the same training set to overestimate the performance. Classical accuracies measures such as the overall accuracy (Q2), the sensitivity (S), the probability of correct predictions (P), the Matthewʼs correlation coefficient (C), the false and true positive rates (FPR, TPR) and the area under the ROC curve (AUC) are used to score the performance of our predictors. A Reliability Index (RI) score has been calculated to select more reliable predictions. More details about the definition of the statistical index used in this work are provided in the supplementary materials.

Acknowledgments

EC acknowledges support from the Marie Curie International Outgoing Fellowship program (PIOF-GA-2009-237225). RBA would like to acknowledge the following funding sources: NIH LM05652 and GM61374. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

Capriotti E, Calabrese R, Casadio R: Predicting the insurgence of human genetic

diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 2006, 22(22):2729-2734.

Capriotti E, Fariselli P, Casadio R: I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 2005, 33(Web

Server issue):W306-310.

Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and

protein complexes: a study of more than 1000 mutations. J Mol Biol 2002, 320(2):369-387.

Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A: LSSNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 2005, 21(12):2814-2820.

Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 2009, 25(21):2744-2750.

Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res

2001, 11(5):863-874.

Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey.

Nucleic Acids Res 2002, 30(17):3894-3900.

Capriotti E, Fariselli P, Rossi I, Casadio R: A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 2008, 9 Suppl 2:S6. Wong WS, Yang Z, Goldman N, Nielsen R: Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 2004, 168(2):1041-1051. Care MA, Needham CJ, Bulpitt AJ, Westhead DR: Deleterious SNP prediction: be mindful of your training data! Bioinformatics 2007, 23(6):664-672.

Berman H, Henrick K, Nakamura H, Markley JL: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 2007, 35(Database issue):D301-303.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped

BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res 1997 , 25(17):3389-3402.

Fig. 3 Log odd score for lost residues interactions (A) and for newly formed interactions (B). The red zones correspond to damaging lost or new interactions. Bleu points correspond to neutral interactions.

SVM-SEQ SVM-3D The accuracy measures are defined in supplementary materials. D, N stands for disease-related and neutral variants respectively.

Q2 P[D] S[D] P[N] S[D] C AUC PM SEQ∩3D 0.86 0.85 0.89 0.88 0.84 0.73 0.92 88 SEQ-3D 0.63 0.66 0.65 0.60 0.60 0.25 0.68 12 3D-SEQ 0.37 0.40 0.35 0.34 0.40 -0.25 0.40 12 SEQ∩3D indicates the subset of agree predictions, SEQ-3D and 3D-SEQ are respectively the predictions of SVM-SEQ and SVM-3D on the not agree prediction subset. The accuracy measures are defined in supplementary materials. PM is the fraction of the dataset. D, N stands for disease-related and neutral variants respectively.

Supplementary Material Improving the prediction of disease-related variants using protein three-dimensional structure.

Departments of Bioengineering* and Genetics‡, Stanford University, Stanford (CA), United States of America; §Department of Mathematics and Computer Sciences, University of Balearic Islands,

Palma de Mallorca, Spain.

{emidio, russ.altman}@stanford.edu

Support Vector Machine (SVM) input features

The SVM-based methods developed in this work consider in input the following features: i) residue mutation; ii) protein sequence profile; iii) functional score based on Gene Ontology (GO) terms and iv) either sequence or structure mutation environment.

Encoding residue mutation

The input vector relative to mutation consists of 20 values: the first 20 (the 20 residue types) explicitly define the mutation by setting to -1 the element corresponding to the wild type residue and to 1 the newly introduced residue (all the remaining elements are kept equal to 0).

Encoding mutation structure environment

The protein structural environment is encoding with a 21 elements vector. The first 20 elements encode for the number of each residue type, which have at least one heavy atom within a radius shell around the C-α of the mutated residue. After an optimization procedure a shell of 6 Å radius has been considered. The 21st element is the relative solvent accessible area calculated using the DSSP program [1].

Encoding mutation sequence environment

The 20 element input values for the mutation sequence environment (the 20 elements represent the 20 residue types) encode for the number of the each residue type, to be found inside a window centered at the residue that undergoes mutation and that symmetrically spans the sequence to the left (N-terminus) and to the right (C-terminus) with a length of 19 residues [2].

Encoding sequence profile information

We derive for each mutation: the frequency of the wild type, the frequency of the mutated residue, the number of totally and locally aligned sequences and a conservation index (CI) for the position at hand: the more a residue is functionally important the more is conserved over evolution [3]. The conservation index is calculated as: where fa(i) is the relative frequency of residue a at mutated position i and fa is the overall frequency of the same residue in the alignment. The sequence profile is computed from the output of the BLAST program [4] running on the uniref90 database (Oct 2009) (Evalue threshold=10-9, number of runs=1).

Functional based score

The Gene Ontology log-odds score (LGO) provides information about the correlation among a given mutation type (disease related and neutral) and the protein function. The annotation data are relative to the GO Database (version Mar 2010) and are retrieved at the web resource hosted at European Bionformatics Institute (EBI). To calculate the LGO, first we derived the GO terms from all the three branches (molecular function, biological process and cellular components) for all our proteins in the dataset. For each annotated term the appropriate ontology tree was traversed upward to retrieve all the parent terms with the GO-TermFinder tool (http://search.cpan.org/dist/GO-TermFinder/) [5] and counting a GO term only once. The log-odds score associated to each protein is calculated as:

LGO=Σ log2[fGO(D)/fGO(N)] where fGO is the frequency of occurrence of a given GO term for the disease-related (D) and neutral mutations (N) adding one pseudo-count to each class. To prevent the overfitting, the LGO scores are evaluated considering fGO values computed over the training sets without including in the GO term counts of the corresponding test set.

Support Vector Machine software

The LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) has been used for the SVM implementation [6]. The selected SVM kernel is a Radial Basis Function (RBF) kernel K(xi,xj)=exp(-γ||xi-xj||2) and γ and C parameters are optimized performing a grid like search. After input rescaling the values of the best parameters are C=8 and γ=0.03125

CI(i)=[Σa=120(fa(i)-fa)2]1/2

Statistical indexes for accuracy measure

The prediction accuracy is scored with several measures. In this paper the efficiency of our predictors have been scored using the following statistical indexes.

The overall accuracy is: where P is the total number of correctly predicted mutations and N is the total number of mutations. The Matthewʼs correlation coefficient C is defined as:

C(s)=[p(s)n(s)-u(s)o(s)] / D where D is the normalization factor: P(s)=p(s) / [p(s) + o(s)] for each class s (D and N, stand for disease-related and neutral mutations respectively); p(s) and n(s) are the total number of correct predictions and correctly rejected assignments, respectively, and u(s) and o(s) are the numbers of false negative and false positive for the class s.

The coverage S (sensitivity) for each discriminated class s is evaluated as: where p(s) and u(s) are the same as in Equation 5.

The probability of correct predictions P (or positive predictive values) is computed as: where p(s) and o(s) are the same as in Equation 5 (ranging from 0 to 1). For each prediction a reliability score (RI) is calculated as follows: where O(D) is the SVM output. Other standard scoring measures, such as the area under the ROC curve (AUC) and the true positive rate (TPR= Q(s)) at 10% of False Positive Rate (FPR= 1-P(s)) are also computed [7].

Kabsch

, Sander

: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features . Biopolymers 1983 , 22 ( 12 ): 2577 - 2637 .

Capriotti

, Calabrese

, Casadio

: Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information . Bioinformatics 2006 , 22 ( 22 ): 2729 - 2734 .

Pei

, Grishin

: AL2CO: calculation of positional conservation in a protein sequence alignment . Bioinformatics 2001 , 17 ( 8 ): 700 - 712 .

Nucleic Acids Res 1997 , 25 ( 17 ): 3389 - 3402 .

Bioinformatics 2004 , 20 ( 18 ): 3710 - 3715 .

Neural Comput 2001 , 13 ( 9 ): 2119 - 2147 .

Baldi

, Brunak

, Chauvin

, Andersen

, Nielsen

: Assessing the accuracy of prediction algorithms for classification: an overview . Bioinformatics 2000 , 16 ( 5 ): 412 - 424 .