Mutation tagging with gene identifiers applied to
                   membrane protein stability prediction


               Rainer Winnenburg, Conrad Plake, and Michael Schroeder
                            Biotec, TU Dresden, Germany
                           ms@biotec.tu-dresden.de


                 Abstract                           ergy based model for the prediction of sta-
                                                    bilising regions in membrane proteins. We
The automated retrieval and integration of
                                                    identified 35 mutations in text. 25 out of
information about protein point mutations in
                                                    35 mutation phenotypes reported in litera-
combination with structure, domain and in-
                                                    ture were in compliance with the prediction
teraction data from literature and databases
                                                    of the energy model, which supports a rela-
promises to be a valuable approach to study
                                                    tion between mutations and stability issues
structure-function relationships in biomedi-
                                                    in membrane proteins.
cal data sets.
As a prerequisite, we developed a rule- and     1   Introduction
regular expression-based protein point muta-
                                                Proteins carry out most cellular functions as they are
tion retrieval pipeline for PubMed abstracts,
                                                acting as building blocks for structures, enzymes,
which shows an F-measure of 87% for the
                                                gene regulators, and are involved in cell mobility
pure mutation retrieval task on a benchmark
                                                and communication (Alberts et al., 2002). Proteins
dataset.
                                                may interact briefly with each other in an enzymatic
In order to link mutations to their proteins,   reaction, or for a long time to form part of a pro-
we utilised a named entity recognition al-      tein complex. The interactions between proteins
gorithm for the identification of gene names    are of central importance for almost all processes
co-occurring in the abstract, and established   in living cells, and are described by numerous dis-
links based on sequence checks. We iden-        tinct pathways in databases such as KEGG (Ogata et
tified more than 10Mio genes/proteins in        al., 1999). Malfunctions or alterations in such path-
nearly 3.5Mio abstracts and 260.000 muta-       ways can be the cause of many diseases, when for
tions in 80.000 of these abtracts (2.3%). In    instance the biosynthesis of involved proteins is re-
52% of cases the identified gene’s sequence     pressed or proteins are not interacting the way they
and the mutation are consistent. We eval-       should. The latter can be due to structural changes
uated the use of mutations in gene identi-      in one of the interacting proteins, caused by point
fication in detail on a small test set of 22    mutations, i.e. single wild type amino acid substi-
abstracts. Identifying the correct gene im-     tutions. Indeed, it is already well known that such
proved from 77% to 91% when considering         mutations are the cause of many hereditary diseases.
the mutations.                                  Thus the large-scale analysis of point mutation data
To demonstrate practical relevance, we set      in combination with information about protein inter-
up a mutation screening for five mem-           actions, protein structure and disease pathogenesis,
brane proteins from the family of G protein-    might facilitate the study of still unresolved pheno-
coupled receptors to evaluate a solvation en-   types and diseases.
   It is envisaged to provide an automated system          a protein is a hot topic or if the information is al-
for the interpretation of structure-function relations     ready available for years. Furthermore, it is possi-
in the context of genetic variability data. De-            ble to receive a more detailed view on a protein’s
spite the availability of numerous biomedical data         characteristics, e.g. if a certain interaction only takes
collections, valuable information about mutation-          place under specific conditions, or if an interaction is
phenotype associations is still hidden in non-             prevented by the conformational change of a protein
structured text in the biomedical literature. Thus text    domain triggered by a point mutation.
mining methods are implemented to automatically
retrieve these data from the 18 millions of literature     2.1   Databases
references in PubMed. The extracted knowledge              Data on mutations have been collected for years, for
will be stored in one homogeneous data store and           numerous species and by different organisations for
integrated with already available data from suitable       diverse purposes. There are many efforts to cope
databases. On the basis of all these combined data,        with the data, which is being made available in a
new hypotheses can be formulated, like the predic-         growing number of databases. The Human Genome
tion of phenotypic effects induced by mutations. At        Variation society (Horaitis and Cotton, 2004) pro-
the moment, we are populating a database with or-          motes the collection, documentation and free distri-
ganism specific protein-mutation associations which        bution of genomic variation information. New mu-
we envisage to apply on diverse biological prob-           tation databases are reported in the Journal Human
lems, such as the detection of mutation centred gene-      Mutation on a regular basis. There are manually cu-
disease associations in human.                             rated databases like OMIM (Hamosh et al., 2002),
                                                           UniProt Knowledgebase (Yip et al., 2008; Yip et al.,
2   Background                                             2007), and general central repositories like the Hu-
                                                           man Gene Mutation Database (Stenson et al., 2008),
Genomic variation data has already been collected          Universal Mutation Database (Broud et al., 2000),
for many years. Single nucleotide polymorphisms            Human Genome Variation Database (Fredman et al.,
(SNPs), which make up about 90% of all human               2004), MutDB (Singh et al., 2007).
genetic variation and occur every 100 to 300 bases            Besides these central repositories, there are small
along the 3-billion-base human genome, are avail-          specialised databases, such as the infevers autoin-
able as large collections. Single amino acid poly-         flammatory mutation online registry (Milhavet et al.,
morphisms (SAPs) are often manually extracted              2008), the GPCR NaVa database for natural variants
from literature and curated into databases, originat-      in human G protein-coupled receptors (Kazius et al.,
ing from wet lab experiments. Additionally, some           2007), or the Pompe disease mutation database with
structures of such mutations may be revealed in            107 sequence variants (Kroos et al., 2008).
crystallography experiments and might eventually              In contrast, unpublished SNPs normally make
end up as distinct structures in the Protein Database      their way into large locus specific data repositories.
PDB. Of particular interest is the identification of       Since August 2006, there is a wiki based approach
mutations which have a strong influence on the sta-        SNPedia in contrast to classical databases collecting
bility of proteins. Therefore, the biomedical liter-       information on variations in human DNA.
ature can be systematically searched for informa-
tion about mutation-phenotype associations by text         2.2   Text mining
mining, which may lead to new insights beyond in-          Despite the availability of numerous biomedical data
formation in existing databases. For the text mined        collections, valuable information about mutation-
data it is additionally possible to weight or prioritise   phenotype associations is still hidden in non-
information according to their publication date, the       structured text in the biomedical literature. Thus
involved authors and the journal. Considering these        text mining methods are implemented to automati-
meta data can be relevant if for instance an already       cally retrieve these data from the 18 millions of ref-
published assumption has been proven wrong in a            erenced articles in PubMed. Text mining aims to au-
more recent publication, or for determining whether        tomatically extract and combine information spread
in several natural language texts and by this generat-     from full-text biomedical literature, which they sub-
ing new hypotheses. One of the key prerequisites for       sequently used for protein structure annotation and
finding new facts (e.g. interactions or mutations) is      visualisation. (Worth et al., 2007) use structure pre-
the named entity recognition (NER) in text, the as-        diction to analyse the effects of nonsynonymous sin-
signment of a class to an entity (e.g. protein), as well   gle nucleotide polymorphisms (nsSNPs) with regard
as a preferred term or identifier, in case an entry in     to diseases. Focussing on Alzheimer’s disease, (Er-
a database, such as UniProt, or a controlled vocabu-       dogmus and Sezerman, 2007) extract mutation-gene
lary like the Gene Ontology (GO) (Ashburner et al.,        pairs, with estimated 91.3%, and precision at 88.9%.
2000) exists. For the task of named entity recogni-        (Lage et al., 2007) realised a human phenome-
tion usually a dictionary is used, which contains a        interactome network of protein complexes impli-
list of all known entity names of a class (e.g. human      cated in genetic disorders by by integrating quality-
proteins) including synonyms. For the recognition          controlled interactions of human proteins with a val-
of patterns (e.g. database identifiers like NM 12345)      idated, computationally derived phenotype similar-
regular expression can be defined. For the analy-          ity score,
sis of whole sentences, Natural language processing
(NLP) techniques are used, which aim to understand         3     Methods
text on a syntactic and semantic level. This approach
is often paired with systems which are based on a          Through the combination of different data from lit-
set of manually defined rules or which make use of         erature and databases it is possible to derive new
(semi-)supervised machine learning algorithms.             facts, e.g. novel gene-disease associations or the in-
   Up to now, there have already been diverse exam-        fluence of mutations on protein-protein interactions.
ples for the successful application of text mining to      The approach is designed in such a way, that it can in
the mutation retrieval task. Early examples are the        principle be applied to any kind of genetic data for
automatic extraction of mutations from Medline and         answering disease centred questions. For the mo-
cross-validation with OMIM (Rebholz-Schuhmann              ment, we concentrate on collecting available high
et al., 2004), and the work by (Cantor and Lussier,        quality data on protein point mutations from curated
2004), who mined OMIM for phenotypic and ge-               databases and from peer-reviewed literature. For the
netic information to gain insights into complex dis-       latter we will present a flexible approach for both the
eases. More recently, (Caporaso et al., 2007b) ap-         specific and high-throughput retrieval of mutations.
plied their concept recognition system based on reg-       In detail, the following tasks have to be performed:
ular expressions on mutation mining task, and the          (1) Identify genes/ proteins in abstracts. (2) From
automatic Extraction of Protein Point Mutations Us-        this subset consider only these which additionally
ing a Graph Bigram association (Lee et al., 2007)          contain information about mutations. (3) Propose
was reported to find reliably gene-mutation associa-       potential protein - mutation pairs. (4) Filter pro-
tions in full text. For identifying gene-specific vari-    posed pairs by sequence compliance. (5) Utilise
ations in biomedical text, (Klinger et al., 2007) inte-    this information for the refinement of the original
grate the ProMiner system developed for the recog-         gene/protein identifier.
nition and normalisation of gene and protein names
                                                           3.1    Entity recognition
with a conditional random field (CRF)-based recog-
nition system. As an answer to the diverse ap-             Gene normalisation This module allows for the
proaches developed over the past years, a framework        automated named entity recognition of genes and
for the systematic analysis of mutation extraction         proteins. Our approach performs gene name dis-
systems was proposed by (Witte and Baker, 2007).           ambiguation by using background knowledge to
   More and more groups are working on mu-                 match a gene with its context against the text as a
tations in proteins and their involvement in dis-          whole (Hakenberg et al., 2007). A gene’s context
eases.       (Kanagasabai et al., 2007) developed          contains information on Gene Ontology annotations,
mSTRAP (Mutation extraction and STRucture An-              functions, tissues, diseases etc. extracted from the
notation Pipeline), for mining mutation annotations        databases Entrez Gene and UniProt. A comparison
of gene contexts against the text gives a ranking of     same sentence. The statistical approach GraB is an
candidate identifiers and the top ranked identifier is   excellent tool for the automatic extraction of Pro-
taken if it scores above a defined threshold. This ap-   tein Point Mutations using a Graph Bigram associ-
proach has been recently extended for inter-species      ation (Lee et al., 2007), achieving good results for
normalisation and achieves 81% success rate on a         most likely mutation-protein association but alone
mixed dataset of 13 species (Hakenberg et al., 2008).    would also not fulfil the second aspect of filtering
Mutation tagging We implemented an entity recog-         out false positives.
nition algorithm (MutationTagger) to automati-           Sequence Checks Mutations are commonly de-
cally extract protein point mutation mentions from       scribed as the substitution of a wild-type by a
PubMed abstracts. Wild-type and mutant amino             mutant amino acid at a given position. Our method
acid, as well as the sequence position of the substi-    compares the wild-type residue as described in a
tution are extracted by means of both a set of regular   mutation mention with the UniProt/Swiss-Prot and
expressions for pattern recognition of 1 or 3-letter-    PDB protein sequences for all candidate proteins.
notations (e.g. E312A or Glu(312)→Ala), and rules        It is important to incorporate sequences from both
for the more complex identification of textual mu-       repositories, as the sequence numbering can differ
tation descriptions (e.g. Glu312 was replaced with       and it is not always evident from a publication’s ab-
alanine). Problems concerning the full text repre-       stract, which numbering the mutation notation refers
sentations (detecting the correct sequence position      to. To map UniProt IDs to PDB and vice versa, we
of the mutated residue and unravelling enumera-          used PDB cross-references in UniProtKB/Swiss-
tions) have been addressed by additional extraction      Prot from http://beta.uniprot.org/docs/pdbtosp
algorithms and the implementation of a sequence          and the residue specific comparison between
check. An evaluation of our method on the test           PDB and SwissProt sequences as provided by
data from MutationFinder (Caporaso et al., 2007a)        http://www.bioinf.org.uk/pdbsws/ (Martin, 2005).
showed comparable success rates of around 89% F-         Only associations between mutations and proteins
measure for mutation mention extraction.                 with matching sequences are considered.

3.2   Association of entity pairs                        3.3   Annotation pipelines
In the process of recognising mutations in text, the
                                                         The developed mutation retrieval pipeline can be
normalisation, i.e. the direct association to specific
                                                         accessed through two different interfaces (see Fig-
proteins, remains a challenge. This is due to the fact
                                                         ure 1), which offer dependent on the annotation task,
that the abstracts of relevant publications typically
                                                         either a systematic or quick and flexible solution.
mention more than only one single mutation and
                                                         The following approaches have been implemented:
protein. Thus, a mutation-protein association purely
based on their co-occurrence in one abstract is not
                                                           • Organism-centred approach (database)
sufficient, as it would result in a permutation with a
huge number of false positive predictions. The prob-           All available mutations for a given organism
lem becomes even more evident, when considering                will be retrieved in one single literature screen-
that both gene and mutation tagging are imperfect,             ing and stored in the Mutation database. This
achieving a precision of 80 to 90% each.                       approach relies on the large-scale identification
    A method is desired, that both disambiguates the           of gene mentions in PubMed abstracts, which
relations of candidate mutations and proteins, and             have to be compiled for organisms of interest
filters out false positives from the underlying indi-          prior to a mutation screening. As of now, gene
vidual mutation and protein recognition tasks. There           mention data is available for human, mouse,
are approaches which apply a word distance met-                and yeast. However, data for additional rele-
ric for assigning a mutation to its nearest occurring          vant organisms will be added on a regular basis
protein term, which is error prone, as matching mu-            in the near future.
tation and protein do not necessarily have to occur
close to each other in the abstract or even in the         • Protein-centred approach (on-the-fly)
Figure 1: Workflow of mutation data retrieval with MutationTagger. A: abstracts mentioning proteins for
given species are tagged for mutations. The filtered data is written to database. B: For a protein of interest
relevant articles are retrieved and tagged for mutations. The filtered data can be exported to HTML or SQL.


      It is possible to retrieve relevant data for a sin-   even if a set of different candidate identifiers was
      gle gene or a list of genes/ proteins for any         computed. According to internal ranking mech-
      organism. For this purpose, the gene identifi-        anisms, only the top scoring candidate is consid-
      cation part performed by the gene normaliser          ered. This leads to a possible scenario, where in
      is replaced by a direct full text search in the       some cases the correct identifier is ranked lower and
      PubMed library using the Entrez Programming           would be neglected for any subsequent data proces-
      Utilities. Again, the result is a set of abstracts,   sion. In case of our mutation mining algorithm, we
      which is subsequently processed by the Muta-          assume that some mutations cannot be associated to
      tionTagger.                                           the correct protein, because the gene tagging task al-
                                                            ready failed.
3.4   Improvement of gene normalisation
As described above, we defined the input set of doc-           On the other hand, it should be possible to im-
uments for the organism-centred mutation mining             prove the performance of both entity recognition
approach by scanning the whole PubMed database              techniques for genes and mutations by combining
for abstracts mentioning at least one gene or protein       the results. The idea is to run both approaches with
of a pre-defined species. For this filtering step, we       low precision thus receiving a high recall, permu-
relied on the gene normalisation techniques of our          tate all elements of both sets, and then consider
gene normaliser, which was applied to all PubMed            the intersection of all combinations that fit. Muta-
abstracts in advance and has shown 85% F-measure            tion and gene product are considered to be a valid
for human genes and slightly lower for other species.       pair, if the wild-type residues at the mutated posi-
However, the gene normalisation proposes by de-             tion in the protein sequence and in the reported mu-
fault only one single identifier per gene mention,          tation match (as described in section 3.1). For all
proposed gene identifiers, protein sequences are ob-
tained and checked for compliance with the reported
wild type amino acid. The score of identifiers that
show a match are increased, which might lead to
a re-ranking of the identifiers for one gene entity.
This could further improve the original gene nor-
malisation approach for candidate entities which are
reported to show a mutation.
Example As shown in Figure 2 our gene normaliser
                                                         Figure 2: Example for gene name normalisation
identified CCP (human crystallin, gamma D; Entrez-
                                                         with the help of mutation mining. Initially, our gene
Gene ID 1421) as the top candidate gene name for
                                                         normaliser proposed the human gene CCP as its
abstract PMID 8142383. The mutation tagger iden-
                                                         context fits the text best (abstract not fully shown).
tified a replacement of tryptophan with glycine at
                                                         However, when comparing the recognised mutation
position 191 as the only mutation mentioned in the
                                                         at position 191 with the sequences of all three candi-
paper. None of the protein sequences retrieved for
                                                         dates, only CCP in yeast contains the wild-type tryp-
human CCP showed a tryptophan residue at position
                                                         tophan at the specified position (PDB entry). After
191, which means that this gene identifier was not
                                                         checking the full text of this publication, we found
supported by mutation information. However, be-
                                                         that CCP indeed refers to the gene in Saccharomyces
sides human crystallin, there was also cytochrome-
                                                         cerevisiae.
c peroxidase in yeast (EntrezGene ID 853940) pro-
posed as an alternative identifier, which received a
lower score. As the product of this gene showed          of potential protein candidates. In a second step, the
a tryptophan residue at postion 191 (according to        mutation extraction algorithm is applied on this cor-
PDB sequencing) the score was increased making           pus and the retrieved information is transferred into
it the new top candidate. Indeed, manual curation        the database. In total, 258,511 mutations were found
of the corresponding literature confirmed, that the      in 78,968 abstracts. Subsequently, for all candidate
only gene mentioned in the abstract is cytochrome-c      genes found in these abstracts, the corresponding se-
peroxidase in yeast. The same positive re-ranking        quences are obtained and checked for compliance
finding the correct gene identifier through muta-        with the wild type amino acid at the position of
tion information was shown for human TP53 in pa-         the mentioned mutation, which led to a number of
per 11254385, and human amylase alpha in paper           877,183 potential protein - mutation pairs. Out of
15182367.                                                these, 127,384 are supported by sequence (74,722
                                                         if multiple mentions of the same mutation in one
4   Results                                              abstract are counted as one) in contrast to 131,127
                                                         (77,643) mutations which have not passed the se-
Mutation database In order to establish a muta-          quence filter. In summary, from all mutations iden-
tion database, which will eventually store all protein   tified by the plain algorithm, about 49% could be
point mutations mentioned in PubMed abstracts for        supported by gene associations based on sequence
all organisms of interest, a first platform has been     check. These data were retrieved from 41,384 (52%)
realised, comprising a MySQL database, which can         abstracts in total.
be accessed by a web-interface.                          Evaluation We evaluated our approach on two dif-
   To populate the database, in a first step the         ferent tasks: pure identification of a mutation in
PubMed corpus is filtered for abstracts mentioning       a text, and the identification of correct mutation-
at least one gene or protein using the named entity      protein pairs. An evaluation of our method on
recognition algorithm as described in Section 3.1,       the test data from MutationFinder (Caporaso et al.,
which is currently working for the three organisms       2007a) showed comparable success rates of around
human, mouse, and yeast. This led to a set of set of     87% F-measure for pure mutation mention extrac-
3,443,566 abstracts proposing more than 10 millions      tion. On the document level, from 182 abstracts con-
taining mutations, 163 were identified, in 4 abstracts    itary diseases, such as cystic fibrosis, or retinitis
mutation were wrongly predicted. On the mutation          pigmentosa. The reason are often conformational
level 741 out of 907 were identified alongside 61         changes in proteins, which may lead to malfunction
false positives.                                          of a whole protein complex. Unfortunately, identi-
   To assess the refinement possibilities for falsely     fied structures for membrane proteins are still rare.
top ranked gene names, from the 182 abstracts we          For this reason, we used a coarse grained model
took the subset of those, the gene normaliser identi-     presented by (Dressel et al., 2008) considering se-
fied genes from one of the 10 supported species: hu-      quence information only, to assess the influence of
man, mouse, yeast, rat, fruit fly, H. pylori, S. Pombe,   mutations on protein structure.
C. Elegans, A. Thaliana, and D. Rerio. This led to           The approach considers the solvation energy,
a subset of 22 abstracts. In the initial run, the gene    which is based on the probability distribution for
name identifier identified in 17 of 22 abstracts (77%)    each amino acid within the integral part of a mem-
the correct gene as the top ranked candidate. How-        brane protein to be facing the membrane or other
ever, after the gene tagging refinement by applying       proteins. The amino acid specific property inside
the sequence filter to all candidate genes, the genes     or outside reflects the orientation of the amino acid
of 3 more papers were identified correctly replacing      side chains with respect to the centre of mass of the
the original and false top candidate. This led to the     neighbouring residues. For a given mutation, the
correct protein normalisation for 20 out of 22 (91%)      approach compares the solvation energies for wild-
publications. For the remaining 2 publication, the        type and mutant residues. If the energies differ sig-
correct genes could not be identified, as they were       nificantly, a destabilising effect is predicted, espe-
from species, the gene identifier does not yet sup-       cially if the energies are changing from negative to
port. The suggested genes from mouse were first           positive or vice versa.
falsely predicted, which were then not supported by          To quantify the ability of this model to pre-
the sequence checks. By this the proposed identi-         dict the influence of mutations on the stability of
fiers were brought below the threshold, resulting in      membrane proteins, we compared already examined
no gene identification at all for these 2 abstracts and   and published effects of mutations with the predic-
turning the 2 “false positives” to “false negatives”.     tions of the sequence based model. For this pur-
On-the-fly vs. database approach We evaluated             pose, we screened the literature for single point mu-
the results of the two access approaches (database        tations reported for five membrane proteins from
and on-the-fly) for human Aquaporin-1, as part of         the family of G protein-coupled receptors (bacteri-
the stability analysis of protein membranes (see Sec-     orhodopsin and halorhodopsin from Halobacterium
tion 5). The precision of the on-the-fly approach is      salinarum, bovine rhodopsin, Na+/H+ antiporter
expected to be lower, as the first step is more general   from Escherichia coli, and human aquaporin-1). As
due to relying on full text searches instead of entity    described in Section 4, Protein-centred approach
recognition. Indeed, in comparison to the unique 20       and Figure 1B, articles relevant for these proteins
mutations found by the organism-centred approach,         were identified by searching PubMed via the NCBI
9 additional mutations were found, of which all were      Entrez Programming Utilities. Abstracts for each
false positives, actually appearing in Aquaporin-2 or     protein were queried by the protein and gene name
4. This supports the good precision of the named en-      including the synonyms as derived from the corre-
tity approach for the gene normalisation.                 sponding PDB/UniProt entry.
                                                             The MutationTagger was applied on these five
5   Application                                           sets of abstracts for the extraction of mutation infor-
                                                          mation. The application of sequence checks brought
Predicting effects of mutations based on sequence         the results down to a reasonable number of proposed
Integral membrane proteins play an important role         mutations, which were presented as HTML docu-
in all organisms, especially as transporters. Due to      ments and subsequently manually curated. We only
their striking importance, mutations in membrane          used the publications where a single point mutation
proteins are known to be the cause of many hered-         was discussed in the context of stability or stabil-
ity related function. Double or multiple mutations        for subsequent studies. The sequence checks applied
were not considered, as the determination of a direct     on identified mutations and candidate proteins have
relation between the reported effect and one of the       been proven to be an efficient, yet not sufficient fil-
mutations is not possible. If an appropriate mutation     ter for determing mutation-protein associations. The
was found in the literature, we compared the solva-       filter shows good sensitivity but improvable speci-
tion energies of both wild-type and mutant residues       ficity, especially regarding the species level. Fur-
to decide, if the mutation was stabilising, slightly      thermore, we were able to show, that the mutation
stabilising, slightly destabilising, or destabilising.    information from literature can even further improve
Example Mutation T93P for bovine rhodopsin was            the quality of the gene tagging algorithm we used,
reported to lead to a conformational change of the        which already showed very good results.
protein. Considering the two solvation energies of
wild type Threonine (-0.66 a.u.) and mutant Proline
(0.08 a.u.) a destabilising effect can be predicted,
                                                          References
although both amino acids are actually classified as      B Alberts, D Bray, K Hopkin, A Johnson, J Lewis,
neutral. Without the change of sign from - to +, an         M Raff, K Roberts, and P Walter. 2002. Essential
                                                            Cell Biology. Garland Science Textbooks, London.
only slightly destabilising effect would have been
hypothesised.                                             Michael Ashburner, Catherine Ball, Judith Blake, David
                                                            Botstein, Heather Butler, J. Cherry, Allan Davis, Kara
Relevance We were able to show the ability of our
                                                            Dolinski, Selina Dwight, Janan Eppig, Midori Har-
mutation mining approach to retrieve publications           ris, David Hill, Laurie Issel-Tarver, Andrew Kasarskis,
containing mutation information for given proteins          Suzanna Lewis, John Matese, Joel Richardson, Martin
at a good precision. Due to the quick and precise           Ringwald, Gerald Rubin, and Gavin Sherlock. 2000.
retrieval of mutation data we were able to assess the       Gene ontology: tool for the unification of biology. the
                                                            gene ontology consortium. Nature genetics., 25:25–
soundness of the coarse grained model for the pre-          29, May. 10.1038/75556.
diction of stabilising regions in membrane proteins.
25 out of 35 mutational effects reported in the liter-    C Broud, G Collod-Broud, C Boileau, T Soussi, and C Ju-
                                                            nien. 2000. Umd (universal mutation database): a
ature for any of these five membrane proteins corre-        generic software to build and analyze locus-specific
late with the predictions based on the solvation en-        databases. Hum Mutat, 15(1):86–94.
ergy. These cases suggest a relation between muta-
                                                          MN Cantor and YA Lussier. 2004. Mining omim for
tions and stability issues in membrane proteins.           insight into complex diseases. Medinfo, 11(Pt 2):753–
Acknowledgement: We are grateful for financial             7.
support by the EU project Sealife and the BMBF
                                                          J. Gregory Caporaso, Jr William A. Baumgartner,
Format Project CLSD and to Frank Dressel and Dirk            David A. Randolph, K. Bretonnel Cohen, and
Labudde for discussions on the application.                  Lawrence Hunter. 2007a. Mutationfinder: A high-
                                                             performance system for extracting point mutation
6   Conclusion                                               mentions from text. Bioinformatics, 23:1862–1865,
                                                             Jul. 10.1093/bioinformatics/btm235.
We developed a rule- and regular expression-based         J. Gregory Caporaso, William A. Baumgartner, David A.
approach that allows for the retrieval of protein point      Randolph, K. Bretonnel Cohen, and Lawrence Hunter.
mutations from the whole PubMed database specif-             2007b. Rapid pattern development for concept recog-
                                                             nition systems: application to point mutations. Jour-
ically for any given protein. This flexibility makes         nal of bioinformatics and computational biology,
it a powerful tool for immediately finding relevant          5:1233–1259, Dec.
data for follow-up studies, as we showed in the ap-
                                                          Andreas Doms and Michael Schroeder. 2005. Gop-
plication on five membrane proteins. In addition,           ubmed: exploring pubmed with the gene on-
MutationTagger can be utilised for the species-wide         tology.    Nucleic Acids Res, 33:W783–6, Jul.
identification of mutations in proteins mentioned in        10.1093/nar/gki470.
PubMed. We started to set up a mutation database          F Dressel, A Marsico, A Tuukkanen, R Winnenburg,
which allows for systematically querying mutation           D Labudde, and M Schroeder. 2008. Stabilizing re-
related information, and finding relevant literature        gions in membrane proteins. In From Computational
  Biophysics to Systems Biology (CBSB08), pages 197–         Lawrence C. Lee, Florence Horn, and Fred E. Co-
  9.                                                           hen. 2007. Automatic extraction of protein point
                                                               mutations using a graph bigram association. PLoS
M Erdogmus and OU Sezerman. 2007. Application of               computational biology, 3:e16, Feb. 10.1371/jour-
  automatic mutation-gene pair extraction to diseases. J       nal.pcbi.0030016.
  Bioinform Comput Biol, 5(6):1261–75, Dec.
                                                             AC Martin. 2005. Mapping pdb chains to uniprotkb en-
D Fredman, G Munns, D Rios, F Sjholm, M Siegfried,             tries. Bioinformatics, 21(23):4297–301, Dec.
  B Lenhard, H Lehvslaiho, and AJ Brookes. 2004.
  Hgvbase: a curated resource describing human dna           F Milhavet, L Cuisset, HM Hoffman, R Slim, H El-
  variation and phenotype relationships. Nucleic Acids         Shanti, I Aksentijevich, S Lesage, H Waterham,
  Res, 32(Database issue):D516–9, Jan.                         C Wise, de Menthiere C Sarrauste, and I Touitou.
                                                               2008. The infevers autoinflammatory mutation online
Jörg Hakenberg, Loic Royer, Conrad Plake, Hendrik             registry: update with new genes and functions. Hum
    Strobelt, and Michael Schroeder. 2007. Me and my           Mutat, Apr.
    friends: gene mention normalization with background
    knowledge. In Proceedings of the Second BioCreative      H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and
    Challenge Evaluation Workshop, pages 141–4.                M. Kanehisa. 1999. Kegg: Kyoto encyclopedia of
                                                               genes and genomes. Nucleic Acids Res, 27:29–34, Jan.
J Hakenberg, C Plake, R Leaman, M Schroeder, and
  G Gonzales. 2008. Inter-species normalization of           D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle,
  gene mentions with GNAT. Bioinformatics. to appear.          G Casari, and H Kirsch. 2004. Automatic extraction
                                                               of mutations from medline and cross-validation with
A Hamosh, AF Scott, J Amberger, C Bocchini, D Valle,           omim. Nucleic Acids Res, 32(1):135–42.
  and VA McKusick. 2002. Online mendelian in-
  heritance in man (omim), a knowledgebase of hu-            A Singh, A Olowoyeye, PH Baenziger, J Dantzer,
  man genes and genetic disorders. Nucleic Acids Res,          MG Kann, P Radivojac, R Heiland, and SD Mooney.
  30(1):52–5, Jan.                                             2007. Mutdb: update on development of tools for
                                                               the biochemical analysis of genetic variation. Nucleic
O Horaitis and RG Cotton. 2004. The challenge of               Acids Res, Sep.
  documenting mutation across the genome: the hu-
  man genome variation society approach. Hum Mutat,          PD Stenson, E Ball, K Howells, A Phillips, M Mort, and
  23(5):447–52, May.                                           DN Cooper. 2008. Human gene mutation database:
                                                               towards a comprehensive central mutation database. J
R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker.           Med Genet, 45(2):124–6, Feb.
  2007. A workflow for mutation extraction and struc-
  ture annotation. J Bioinform Comput Biol, 5(6):1319–       R Witte and CJ Baker. 2007. Towards a systematic eval-
  37, Dec.                                                     uation of protein mutation extraction systems. J Bioin-
                                                               form Comput Biol, 5(6):1339–59, Dec.
J Kazius, K Wurdinger, Iterson M van, J Kok, T Bck,
  and AP Ijzerman. 2007. Gpcr nava database: natural         CL Worth, GR Bickerton, A Schreyer, JR Forman,
  variants in human g protein-coupled receptors. Hum           TM Cheng, S Lee, S Gong, DF Burke, and TL Blun-
  Mutat, Oct.                                                  dell. 2007. A structural bioinformatics approach to
                                                               the analysis of nonsynonymous single nucleotide poly-
R Klinger, CM Friedrich, HT Mevissen, J Fluck,                 morphisms (nssnps) and their relation to disease. J
  M Hofmann-Apitius, LI Furlong, and F Sanz. 2007.             Bioinform Comput Biol, 5(6):1297–318, Dec.
  Identifying gene-specific variations in biomedical text.
  J Bioinform Comput Biol, 5(6):1277–96, Dec.                YL Yip, N Lachenal, V Pillet, and AL Veuthey. 2007.
                                                               Retrieving mutation-specific information for human
M Kroos, RJ Pomponio, Vliet L van, RE Palmer,                  proteins in uniprot/swiss-prot knowledgebase.    J
 M Phipps, der Helm R Van, D Halley, and A Reuser              Bioinform Comput Biol, 5(6):1215–31, Dec.
 and. 2008. Update of the pompe disease mutation
 database with 107 sequence variants and a format for        YL Yip, M Famiglietti, A Gos, PD Duek, FP David,
 severity rating. Hum Mutat, Apr.                              A Gateau, and A Bairoch. 2008. Annotating single
                                                               amino acid polymorphisms in the uniprot/swiss-prot
K Lage, EO Karlberg, ZM Strling, PI Olason, AG Peder-          knowledgebase. Hum Mutat, Jan.
  sen, O Rigina, AM Hinsby, Z Tmer, F Pociot, N Tom-
  merup, Y Moreau, and S Brunak. 2007. A hu-
  man phenome-interactome network of protein com-
  plexes implicated in genetic disorders. Nat Biotech-
  nol, 25(3):309–16, Mar.