Mutation tagging with gene identifiers applied to membrane protein stability prediction Rainer Winnenburg, Conrad Plake, and Michael Schroeder Biotec, TU Dresden, Germany ms@biotec.tu-dresden.de Abstract ergy based model for the prediction of sta- bilising regions in membrane proteins. We The automated retrieval and integration of identified 35 mutations in text. 25 out of information about protein point mutations in 35 mutation phenotypes reported in litera- combination with structure, domain and in- ture were in compliance with the prediction teraction data from literature and databases of the energy model, which supports a rela- promises to be a valuable approach to study tion between mutations and stability issues structure-function relationships in biomedi- in membrane proteins. cal data sets. As a prerequisite, we developed a rule- and 1 Introduction regular expression-based protein point muta- Proteins carry out most cellular functions as they are tion retrieval pipeline for PubMed abstracts, acting as building blocks for structures, enzymes, which shows an F-measure of 87% for the gene regulators, and are involved in cell mobility pure mutation retrieval task on a benchmark and communication (Alberts et al., 2002). Proteins dataset. may interact briefly with each other in an enzymatic In order to link mutations to their proteins, reaction, or for a long time to form part of a pro- we utilised a named entity recognition al- tein complex. The interactions between proteins gorithm for the identification of gene names are of central importance for almost all processes co-occurring in the abstract, and established in living cells, and are described by numerous dis- links based on sequence checks. We iden- tinct pathways in databases such as KEGG (Ogata et tified more than 10Mio genes/proteins in al., 1999). Malfunctions or alterations in such path- nearly 3.5Mio abstracts and 260.000 muta- ways can be the cause of many diseases, when for tions in 80.000 of these abtracts (2.3%). In instance the biosynthesis of involved proteins is re- 52% of cases the identified gene’s sequence pressed or proteins are not interacting the way they and the mutation are consistent. We eval- should. The latter can be due to structural changes uated the use of mutations in gene identi- in one of the interacting proteins, caused by point fication in detail on a small test set of 22 mutations, i.e. single wild type amino acid substi- abstracts. Identifying the correct gene im- tutions. Indeed, it is already well known that such proved from 77% to 91% when considering mutations are the cause of many hereditary diseases. the mutations. Thus the large-scale analysis of point mutation data To demonstrate practical relevance, we set in combination with information about protein inter- up a mutation screening for five mem- actions, protein structure and disease pathogenesis, brane proteins from the family of G protein- might facilitate the study of still unresolved pheno- coupled receptors to evaluate a solvation en- types and diseases. It is envisaged to provide an automated system a protein is a hot topic or if the information is al- for the interpretation of structure-function relations ready available for years. Furthermore, it is possi- in the context of genetic variability data. De- ble to receive a more detailed view on a protein’s spite the availability of numerous biomedical data characteristics, e.g. if a certain interaction only takes collections, valuable information about mutation- place under specific conditions, or if an interaction is phenotype associations is still hidden in non- prevented by the conformational change of a protein structured text in the biomedical literature. Thus text domain triggered by a point mutation. mining methods are implemented to automatically retrieve these data from the 18 millions of literature 2.1 Databases references in PubMed. The extracted knowledge Data on mutations have been collected for years, for will be stored in one homogeneous data store and numerous species and by different organisations for integrated with already available data from suitable diverse purposes. There are many efforts to cope databases. On the basis of all these combined data, with the data, which is being made available in a new hypotheses can be formulated, like the predic- growing number of databases. The Human Genome tion of phenotypic effects induced by mutations. At Variation society (Horaitis and Cotton, 2004) pro- the moment, we are populating a database with or- motes the collection, documentation and free distri- ganism specific protein-mutation associations which bution of genomic variation information. New mu- we envisage to apply on diverse biological prob- tation databases are reported in the Journal Human lems, such as the detection of mutation centred gene- Mutation on a regular basis. There are manually cu- disease associations in human. rated databases like OMIM (Hamosh et al., 2002), UniProt Knowledgebase (Yip et al., 2008; Yip et al., 2 Background 2007), and general central repositories like the Hu- man Gene Mutation Database (Stenson et al., 2008), Genomic variation data has already been collected Universal Mutation Database (Broud et al., 2000), for many years. Single nucleotide polymorphisms Human Genome Variation Database (Fredman et al., (SNPs), which make up about 90% of all human 2004), MutDB (Singh et al., 2007). genetic variation and occur every 100 to 300 bases Besides these central repositories, there are small along the 3-billion-base human genome, are avail- specialised databases, such as the infevers autoin- able as large collections. Single amino acid poly- flammatory mutation online registry (Milhavet et al., morphisms (SAPs) are often manually extracted 2008), the GPCR NaVa database for natural variants from literature and curated into databases, originat- in human G protein-coupled receptors (Kazius et al., ing from wet lab experiments. Additionally, some 2007), or the Pompe disease mutation database with structures of such mutations may be revealed in 107 sequence variants (Kroos et al., 2008). crystallography experiments and might eventually In contrast, unpublished SNPs normally make end up as distinct structures in the Protein Database their way into large locus specific data repositories. PDB. Of particular interest is the identification of Since August 2006, there is a wiki based approach mutations which have a strong influence on the sta- SNPedia in contrast to classical databases collecting bility of proteins. Therefore, the biomedical liter- information on variations in human DNA. ature can be systematically searched for informa- tion about mutation-phenotype associations by text 2.2 Text mining mining, which may lead to new insights beyond in- Despite the availability of numerous biomedical data formation in existing databases. For the text mined collections, valuable information about mutation- data it is additionally possible to weight or prioritise phenotype associations is still hidden in non- information according to their publication date, the structured text in the biomedical literature. Thus involved authors and the journal. Considering these text mining methods are implemented to automati- meta data can be relevant if for instance an already cally retrieve these data from the 18 millions of ref- published assumption has been proven wrong in a erenced articles in PubMed. Text mining aims to au- more recent publication, or for determining whether tomatically extract and combine information spread in several natural language texts and by this generat- from full-text biomedical literature, which they sub- ing new hypotheses. One of the key prerequisites for sequently used for protein structure annotation and finding new facts (e.g. interactions or mutations) is visualisation. (Worth et al., 2007) use structure pre- the named entity recognition (NER) in text, the as- diction to analyse the effects of nonsynonymous sin- signment of a class to an entity (e.g. protein), as well gle nucleotide polymorphisms (nsSNPs) with regard as a preferred term or identifier, in case an entry in to diseases. Focussing on Alzheimer’s disease, (Er- a database, such as UniProt, or a controlled vocabu- dogmus and Sezerman, 2007) extract mutation-gene lary like the Gene Ontology (GO) (Ashburner et al., pairs, with estimated 91.3%, and precision at 88.9%. 2000) exists. For the task of named entity recogni- (Lage et al., 2007) realised a human phenome- tion usually a dictionary is used, which contains a interactome network of protein complexes impli- list of all known entity names of a class (e.g. human cated in genetic disorders by by integrating quality- proteins) including synonyms. For the recognition controlled interactions of human proteins with a val- of patterns (e.g. database identifiers like NM 12345) idated, computationally derived phenotype similar- regular expression can be defined. For the analy- ity score, sis of whole sentences, Natural language processing (NLP) techniques are used, which aim to understand 3 Methods text on a syntactic and semantic level. This approach is often paired with systems which are based on a Through the combination of different data from lit- set of manually defined rules or which make use of erature and databases it is possible to derive new (semi-)supervised machine learning algorithms. facts, e.g. novel gene-disease associations or the in- Up to now, there have already been diverse exam- fluence of mutations on protein-protein interactions. ples for the successful application of text mining to The approach is designed in such a way, that it can in the mutation retrieval task. Early examples are the principle be applied to any kind of genetic data for automatic extraction of mutations from Medline and answering disease centred questions. For the mo- cross-validation with OMIM (Rebholz-Schuhmann ment, we concentrate on collecting available high et al., 2004), and the work by (Cantor and Lussier, quality data on protein point mutations from curated 2004), who mined OMIM for phenotypic and ge- databases and from peer-reviewed literature. For the netic information to gain insights into complex dis- latter we will present a flexible approach for both the eases. More recently, (Caporaso et al., 2007b) ap- specific and high-throughput retrieval of mutations. plied their concept recognition system based on reg- In detail, the following tasks have to be performed: ular expressions on mutation mining task, and the (1) Identify genes/ proteins in abstracts. (2) From automatic Extraction of Protein Point Mutations Us- this subset consider only these which additionally ing a Graph Bigram association (Lee et al., 2007) contain information about mutations. (3) Propose was reported to find reliably gene-mutation associa- potential protein - mutation pairs. (4) Filter pro- tions in full text. For identifying gene-specific vari- posed pairs by sequence compliance. (5) Utilise ations in biomedical text, (Klinger et al., 2007) inte- this information for the refinement of the original grate the ProMiner system developed for the recog- gene/protein identifier. nition and normalisation of gene and protein names 3.1 Entity recognition with a conditional random field (CRF)-based recog- nition system. As an answer to the diverse ap- Gene normalisation This module allows for the proaches developed over the past years, a framework automated named entity recognition of genes and for the systematic analysis of mutation extraction proteins. Our approach performs gene name dis- systems was proposed by (Witte and Baker, 2007). ambiguation by using background knowledge to More and more groups are working on mu- match a gene with its context against the text as a tations in proteins and their involvement in dis- whole (Hakenberg et al., 2007). A gene’s context eases. (Kanagasabai et al., 2007) developed contains information on Gene Ontology annotations, mSTRAP (Mutation extraction and STRucture An- functions, tissues, diseases etc. extracted from the notation Pipeline), for mining mutation annotations databases Entrez Gene and UniProt. A comparison of gene contexts against the text gives a ranking of same sentence. The statistical approach GraB is an candidate identifiers and the top ranked identifier is excellent tool for the automatic extraction of Pro- taken if it scores above a defined threshold. This ap- tein Point Mutations using a Graph Bigram associ- proach has been recently extended for inter-species ation (Lee et al., 2007), achieving good results for normalisation and achieves 81% success rate on a most likely mutation-protein association but alone mixed dataset of 13 species (Hakenberg et al., 2008). would also not fulfil the second aspect of filtering Mutation tagging We implemented an entity recog- out false positives. nition algorithm (MutationTagger) to automati- Sequence Checks Mutations are commonly de- cally extract protein point mutation mentions from scribed as the substitution of a wild-type by a PubMed abstracts. Wild-type and mutant amino mutant amino acid at a given position. Our method acid, as well as the sequence position of the substi- compares the wild-type residue as described in a tution are extracted by means of both a set of regular mutation mention with the UniProt/Swiss-Prot and expressions for pattern recognition of 1 or 3-letter- PDB protein sequences for all candidate proteins. notations (e.g. E312A or Glu(312)→Ala), and rules It is important to incorporate sequences from both for the more complex identification of textual mu- repositories, as the sequence numbering can differ tation descriptions (e.g. Glu312 was replaced with and it is not always evident from a publication’s ab- alanine). Problems concerning the full text repre- stract, which numbering the mutation notation refers sentations (detecting the correct sequence position to. To map UniProt IDs to PDB and vice versa, we of the mutated residue and unravelling enumera- used PDB cross-references in UniProtKB/Swiss- tions) have been addressed by additional extraction Prot from http://beta.uniprot.org/docs/pdbtosp algorithms and the implementation of a sequence and the residue specific comparison between check. An evaluation of our method on the test PDB and SwissProt sequences as provided by data from MutationFinder (Caporaso et al., 2007a) http://www.bioinf.org.uk/pdbsws/ (Martin, 2005). showed comparable success rates of around 89% F- Only associations between mutations and proteins measure for mutation mention extraction. with matching sequences are considered. 3.2 Association of entity pairs 3.3 Annotation pipelines In the process of recognising mutations in text, the The developed mutation retrieval pipeline can be normalisation, i.e. the direct association to specific accessed through two different interfaces (see Fig- proteins, remains a challenge. This is due to the fact ure 1), which offer dependent on the annotation task, that the abstracts of relevant publications typically either a systematic or quick and flexible solution. mention more than only one single mutation and The following approaches have been implemented: protein. Thus, a mutation-protein association purely based on their co-occurrence in one abstract is not • Organism-centred approach (database) sufficient, as it would result in a permutation with a huge number of false positive predictions. The prob- All available mutations for a given organism lem becomes even more evident, when considering will be retrieved in one single literature screen- that both gene and mutation tagging are imperfect, ing and stored in the Mutation database. This achieving a precision of 80 to 90% each. approach relies on the large-scale identification A method is desired, that both disambiguates the of gene mentions in PubMed abstracts, which relations of candidate mutations and proteins, and have to be compiled for organisms of interest filters out false positives from the underlying indi- prior to a mutation screening. As of now, gene vidual mutation and protein recognition tasks. There mention data is available for human, mouse, are approaches which apply a word distance met- and yeast. However, data for additional rele- ric for assigning a mutation to its nearest occurring vant organisms will be added on a regular basis protein term, which is error prone, as matching mu- in the near future. tation and protein do not necessarily have to occur close to each other in the abstract or even in the • Protein-centred approach (on-the-fly) Figure 1: Workflow of mutation data retrieval with MutationTagger. A: abstracts mentioning proteins for given species are tagged for mutations. The filtered data is written to database. B: For a protein of interest relevant articles are retrieved and tagged for mutations. The filtered data can be exported to HTML or SQL. It is possible to retrieve relevant data for a sin- even if a set of different candidate identifiers was gle gene or a list of genes/ proteins for any computed. According to internal ranking mech- organism. For this purpose, the gene identifi- anisms, only the top scoring candidate is consid- cation part performed by the gene normaliser ered. This leads to a possible scenario, where in is replaced by a direct full text search in the some cases the correct identifier is ranked lower and PubMed library using the Entrez Programming would be neglected for any subsequent data proces- Utilities. Again, the result is a set of abstracts, sion. In case of our mutation mining algorithm, we which is subsequently processed by the Muta- assume that some mutations cannot be associated to tionTagger. the correct protein, because the gene tagging task al- ready failed. 3.4 Improvement of gene normalisation As described above, we defined the input set of doc- On the other hand, it should be possible to im- uments for the organism-centred mutation mining prove the performance of both entity recognition approach by scanning the whole PubMed database techniques for genes and mutations by combining for abstracts mentioning at least one gene or protein the results. The idea is to run both approaches with of a pre-defined species. For this filtering step, we low precision thus receiving a high recall, permu- relied on the gene normalisation techniques of our tate all elements of both sets, and then consider gene normaliser, which was applied to all PubMed the intersection of all combinations that fit. Muta- abstracts in advance and has shown 85% F-measure tion and gene product are considered to be a valid for human genes and slightly lower for other species. pair, if the wild-type residues at the mutated posi- However, the gene normalisation proposes by de- tion in the protein sequence and in the reported mu- fault only one single identifier per gene mention, tation match (as described in section 3.1). For all proposed gene identifiers, protein sequences are ob- tained and checked for compliance with the reported wild type amino acid. The score of identifiers that show a match are increased, which might lead to a re-ranking of the identifiers for one gene entity. This could further improve the original gene nor- malisation approach for candidate entities which are reported to show a mutation. Example As shown in Figure 2 our gene normaliser Figure 2: Example for gene name normalisation identified CCP (human crystallin, gamma D; Entrez- with the help of mutation mining. Initially, our gene Gene ID 1421) as the top candidate gene name for normaliser proposed the human gene CCP as its abstract PMID 8142383. The mutation tagger iden- context fits the text best (abstract not fully shown). tified a replacement of tryptophan with glycine at However, when comparing the recognised mutation position 191 as the only mutation mentioned in the at position 191 with the sequences of all three candi- paper. None of the protein sequences retrieved for dates, only CCP in yeast contains the wild-type tryp- human CCP showed a tryptophan residue at position tophan at the specified position (PDB entry). After 191, which means that this gene identifier was not checking the full text of this publication, we found supported by mutation information. However, be- that CCP indeed refers to the gene in Saccharomyces sides human crystallin, there was also cytochrome- cerevisiae. c peroxidase in yeast (EntrezGene ID 853940) pro- posed as an alternative identifier, which received a lower score. As the product of this gene showed of potential protein candidates. In a second step, the a tryptophan residue at postion 191 (according to mutation extraction algorithm is applied on this cor- PDB sequencing) the score was increased making pus and the retrieved information is transferred into it the new top candidate. Indeed, manual curation the database. In total, 258,511 mutations were found of the corresponding literature confirmed, that the in 78,968 abstracts. Subsequently, for all candidate only gene mentioned in the abstract is cytochrome-c genes found in these abstracts, the corresponding se- peroxidase in yeast. The same positive re-ranking quences are obtained and checked for compliance finding the correct gene identifier through muta- with the wild type amino acid at the position of tion information was shown for human TP53 in pa- the mentioned mutation, which led to a number of per 11254385, and human amylase alpha in paper 877,183 potential protein - mutation pairs. Out of 15182367. these, 127,384 are supported by sequence (74,722 if multiple mentions of the same mutation in one 4 Results abstract are counted as one) in contrast to 131,127 (77,643) mutations which have not passed the se- Mutation database In order to establish a muta- quence filter. In summary, from all mutations iden- tion database, which will eventually store all protein tified by the plain algorithm, about 49% could be point mutations mentioned in PubMed abstracts for supported by gene associations based on sequence all organisms of interest, a first platform has been check. These data were retrieved from 41,384 (52%) realised, comprising a MySQL database, which can abstracts in total. be accessed by a web-interface. Evaluation We evaluated our approach on two dif- To populate the database, in a first step the ferent tasks: pure identification of a mutation in PubMed corpus is filtered for abstracts mentioning a text, and the identification of correct mutation- at least one gene or protein using the named entity protein pairs. An evaluation of our method on recognition algorithm as described in Section 3.1, the test data from MutationFinder (Caporaso et al., which is currently working for the three organisms 2007a) showed comparable success rates of around human, mouse, and yeast. This led to a set of set of 87% F-measure for pure mutation mention extrac- 3,443,566 abstracts proposing more than 10 millions tion. On the document level, from 182 abstracts con- taining mutations, 163 were identified, in 4 abstracts itary diseases, such as cystic fibrosis, or retinitis mutation were wrongly predicted. On the mutation pigmentosa. The reason are often conformational level 741 out of 907 were identified alongside 61 changes in proteins, which may lead to malfunction false positives. of a whole protein complex. Unfortunately, identi- To assess the refinement possibilities for falsely fied structures for membrane proteins are still rare. top ranked gene names, from the 182 abstracts we For this reason, we used a coarse grained model took the subset of those, the gene normaliser identi- presented by (Dressel et al., 2008) considering se- fied genes from one of the 10 supported species: hu- quence information only, to assess the influence of man, mouse, yeast, rat, fruit fly, H. pylori, S. Pombe, mutations on protein structure. C. Elegans, A. Thaliana, and D. Rerio. This led to The approach considers the solvation energy, a subset of 22 abstracts. In the initial run, the gene which is based on the probability distribution for name identifier identified in 17 of 22 abstracts (77%) each amino acid within the integral part of a mem- the correct gene as the top ranked candidate. How- brane protein to be facing the membrane or other ever, after the gene tagging refinement by applying proteins. The amino acid specific property inside the sequence filter to all candidate genes, the genes or outside reflects the orientation of the amino acid of 3 more papers were identified correctly replacing side chains with respect to the centre of mass of the the original and false top candidate. This led to the neighbouring residues. For a given mutation, the correct protein normalisation for 20 out of 22 (91%) approach compares the solvation energies for wild- publications. For the remaining 2 publication, the type and mutant residues. If the energies differ sig- correct genes could not be identified, as they were nificantly, a destabilising effect is predicted, espe- from species, the gene identifier does not yet sup- cially if the energies are changing from negative to port. The suggested genes from mouse were first positive or vice versa. falsely predicted, which were then not supported by To quantify the ability of this model to pre- the sequence checks. By this the proposed identi- dict the influence of mutations on the stability of fiers were brought below the threshold, resulting in membrane proteins, we compared already examined no gene identification at all for these 2 abstracts and and published effects of mutations with the predic- turning the 2 “false positives” to “false negatives”. tions of the sequence based model. For this pur- On-the-fly vs. database approach We evaluated pose, we screened the literature for single point mu- the results of the two access approaches (database tations reported for five membrane proteins from and on-the-fly) for human Aquaporin-1, as part of the family of G protein-coupled receptors (bacteri- the stability analysis of protein membranes (see Sec- orhodopsin and halorhodopsin from Halobacterium tion 5). The precision of the on-the-fly approach is salinarum, bovine rhodopsin, Na+/H+ antiporter expected to be lower, as the first step is more general from Escherichia coli, and human aquaporin-1). As due to relying on full text searches instead of entity described in Section 4, Protein-centred approach recognition. Indeed, in comparison to the unique 20 and Figure 1B, articles relevant for these proteins mutations found by the organism-centred approach, were identified by searching PubMed via the NCBI 9 additional mutations were found, of which all were Entrez Programming Utilities. Abstracts for each false positives, actually appearing in Aquaporin-2 or protein were queried by the protein and gene name 4. This supports the good precision of the named en- including the synonyms as derived from the corre- tity approach for the gene normalisation. sponding PDB/UniProt entry. The MutationTagger was applied on these five 5 Application sets of abstracts for the extraction of mutation infor- mation. The application of sequence checks brought Predicting effects of mutations based on sequence the results down to a reasonable number of proposed Integral membrane proteins play an important role mutations, which were presented as HTML docu- in all organisms, especially as transporters. Due to ments and subsequently manually curated. We only their striking importance, mutations in membrane used the publications where a single point mutation proteins are known to be the cause of many hered- was discussed in the context of stability or stabil- ity related function. Double or multiple mutations for subsequent studies. The sequence checks applied were not considered, as the determination of a direct on identified mutations and candidate proteins have relation between the reported effect and one of the been proven to be an efficient, yet not sufficient fil- mutations is not possible. If an appropriate mutation ter for determing mutation-protein associations. The was found in the literature, we compared the solva- filter shows good sensitivity but improvable speci- tion energies of both wild-type and mutant residues ficity, especially regarding the species level. Fur- to decide, if the mutation was stabilising, slightly thermore, we were able to show, that the mutation stabilising, slightly destabilising, or destabilising. information from literature can even further improve Example Mutation T93P for bovine rhodopsin was the quality of the gene tagging algorithm we used, reported to lead to a conformational change of the which already showed very good results. protein. Considering the two solvation energies of wild type Threonine (-0.66 a.u.) and mutant Proline (0.08 a.u.) a destabilising effect can be predicted, References although both amino acids are actually classified as B Alberts, D Bray, K Hopkin, A Johnson, J Lewis, neutral. Without the change of sign from - to +, an M Raff, K Roberts, and P Walter. 2002. Essential Cell Biology. Garland Science Textbooks, London. only slightly destabilising effect would have been hypothesised. Michael Ashburner, Catherine Ball, Judith Blake, David Botstein, Heather Butler, J. Cherry, Allan Davis, Kara Relevance We were able to show the ability of our Dolinski, Selina Dwight, Janan Eppig, Midori Har- mutation mining approach to retrieve publications ris, David Hill, Laurie Issel-Tarver, Andrew Kasarskis, containing mutation information for given proteins Suzanna Lewis, John Matese, Joel Richardson, Martin at a good precision. Due to the quick and precise Ringwald, Gerald Rubin, and Gavin Sherlock. 2000. retrieval of mutation data we were able to assess the Gene ontology: tool for the unification of biology. the gene ontology consortium. Nature genetics., 25:25– soundness of the coarse grained model for the pre- 29, May. 10.1038/75556. diction of stabilising regions in membrane proteins. 25 out of 35 mutational effects reported in the liter- C Broud, G Collod-Broud, C Boileau, T Soussi, and C Ju- nien. 2000. Umd (universal mutation database): a ature for any of these five membrane proteins corre- generic software to build and analyze locus-specific late with the predictions based on the solvation en- databases. Hum Mutat, 15(1):86–94. ergy. These cases suggest a relation between muta- MN Cantor and YA Lussier. 2004. Mining omim for tions and stability issues in membrane proteins. insight into complex diseases. Medinfo, 11(Pt 2):753– Acknowledgement: We are grateful for financial 7. support by the EU project Sealife and the BMBF J. Gregory Caporaso, Jr William A. Baumgartner, Format Project CLSD and to Frank Dressel and Dirk David A. Randolph, K. Bretonnel Cohen, and Labudde for discussions on the application. Lawrence Hunter. 2007a. Mutationfinder: A high- performance system for extracting point mutation 6 Conclusion mentions from text. Bioinformatics, 23:1862–1865, Jul. 10.1093/bioinformatics/btm235. We developed a rule- and regular expression-based J. Gregory Caporaso, William A. Baumgartner, David A. approach that allows for the retrieval of protein point Randolph, K. Bretonnel Cohen, and Lawrence Hunter. mutations from the whole PubMed database specif- 2007b. Rapid pattern development for concept recog- nition systems: application to point mutations. Jour- ically for any given protein. This flexibility makes nal of bioinformatics and computational biology, it a powerful tool for immediately finding relevant 5:1233–1259, Dec. data for follow-up studies, as we showed in the ap- Andreas Doms and Michael Schroeder. 2005. Gop- plication on five membrane proteins. In addition, ubmed: exploring pubmed with the gene on- MutationTagger can be utilised for the species-wide tology. Nucleic Acids Res, 33:W783–6, Jul. identification of mutations in proteins mentioned in 10.1093/nar/gki470. PubMed. We started to set up a mutation database F Dressel, A Marsico, A Tuukkanen, R Winnenburg, which allows for systematically querying mutation D Labudde, and M Schroeder. 2008. Stabilizing re- related information, and finding relevant literature gions in membrane proteins. In From Computational Biophysics to Systems Biology (CBSB08), pages 197– Lawrence C. Lee, Florence Horn, and Fred E. Co- 9. hen. 2007. Automatic extraction of protein point mutations using a graph bigram association. PLoS M Erdogmus and OU Sezerman. 2007. Application of computational biology, 3:e16, Feb. 10.1371/jour- automatic mutation-gene pair extraction to diseases. J nal.pcbi.0030016. Bioinform Comput Biol, 5(6):1261–75, Dec. AC Martin. 2005. Mapping pdb chains to uniprotkb en- D Fredman, G Munns, D Rios, F Sjholm, M Siegfried, tries. Bioinformatics, 21(23):4297–301, Dec. B Lenhard, H Lehvslaiho, and AJ Brookes. 2004. Hgvbase: a curated resource describing human dna F Milhavet, L Cuisset, HM Hoffman, R Slim, H El- variation and phenotype relationships. Nucleic Acids Shanti, I Aksentijevich, S Lesage, H Waterham, Res, 32(Database issue):D516–9, Jan. C Wise, de Menthiere C Sarrauste, and I Touitou. 2008. The infevers autoinflammatory mutation online Jörg Hakenberg, Loic Royer, Conrad Plake, Hendrik registry: update with new genes and functions. Hum Strobelt, and Michael Schroeder. 2007. Me and my Mutat, Apr. friends: gene mention normalization with background knowledge. In Proceedings of the Second BioCreative H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and Challenge Evaluation Workshop, pages 141–4. M. Kanehisa. 1999. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 27:29–34, Jan. J Hakenberg, C Plake, R Leaman, M Schroeder, and G Gonzales. 2008. Inter-species normalization of D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, gene mentions with GNAT. Bioinformatics. to appear. G Casari, and H Kirsch. 2004. Automatic extraction of mutations from medline and cross-validation with A Hamosh, AF Scott, J Amberger, C Bocchini, D Valle, omim. Nucleic Acids Res, 32(1):135–42. and VA McKusick. 2002. Online mendelian in- heritance in man (omim), a knowledgebase of hu- A Singh, A Olowoyeye, PH Baenziger, J Dantzer, man genes and genetic disorders. Nucleic Acids Res, MG Kann, P Radivojac, R Heiland, and SD Mooney. 30(1):52–5, Jan. 2007. Mutdb: update on development of tools for the biochemical analysis of genetic variation. Nucleic O Horaitis and RG Cotton. 2004. The challenge of Acids Res, Sep. documenting mutation across the genome: the hu- man genome variation society approach. Hum Mutat, PD Stenson, E Ball, K Howells, A Phillips, M Mort, and 23(5):447–52, May. DN Cooper. 2008. Human gene mutation database: towards a comprehensive central mutation database. J R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. Med Genet, 45(2):124–6, Feb. 2007. A workflow for mutation extraction and struc- ture annotation. J Bioinform Comput Biol, 5(6):1319– R Witte and CJ Baker. 2007. Towards a systematic eval- 37, Dec. uation of protein mutation extraction systems. J Bioin- form Comput Biol, 5(6):1339–59, Dec. J Kazius, K Wurdinger, Iterson M van, J Kok, T Bck, and AP Ijzerman. 2007. Gpcr nava database: natural CL Worth, GR Bickerton, A Schreyer, JR Forman, variants in human g protein-coupled receptors. Hum TM Cheng, S Lee, S Gong, DF Burke, and TL Blun- Mutat, Oct. dell. 2007. A structural bioinformatics approach to the analysis of nonsynonymous single nucleotide poly- R Klinger, CM Friedrich, HT Mevissen, J Fluck, morphisms (nssnps) and their relation to disease. J M Hofmann-Apitius, LI Furlong, and F Sanz. 2007. Bioinform Comput Biol, 5(6):1297–318, Dec. Identifying gene-specific variations in biomedical text. J Bioinform Comput Biol, 5(6):1277–96, Dec. YL Yip, N Lachenal, V Pillet, and AL Veuthey. 2007. Retrieving mutation-specific information for human M Kroos, RJ Pomponio, Vliet L van, RE Palmer, proteins in uniprot/swiss-prot knowledgebase. J M Phipps, der Helm R Van, D Halley, and A Reuser Bioinform Comput Biol, 5(6):1215–31, Dec. and. 2008. Update of the pompe disease mutation database with 107 sequence variants and a format for YL Yip, M Famiglietti, A Gos, PD Duek, FP David, severity rating. Hum Mutat, Apr. A Gateau, and A Bairoch. 2008. Annotating single amino acid polymorphisms in the uniprot/swiss-prot K Lage, EO Karlberg, ZM Strling, PI Olason, AG Peder- knowledgebase. Hum Mutat, Jan. sen, O Rigina, AM Hinsby, Z Tmer, F Pociot, N Tom- merup, Y Moreau, and S Brunak. 2007. A hu- man phenome-interactome network of protein com- plexes implicated in genetic disorders. Nat Biotech- nol, 25(3):309–16, Mar.