=Paper= {{Paper |id=Vol-429/paper-4 |storemode=property |title=Annotation of protein residues based on a literature analysis: cross-validation against UniProtKB |pdfUrl=https://ceur-ws.org/Vol-429/paper4.pdf |volume=Vol-429 |dblpUrl=https://dblp.org/rec/conf/eccb/NagelJR08 }} ==Annotation of protein residues based on a literature analysis: cross-validation against UniProtKB== https://ceur-ws.org/Vol-429/paper4.pdf
Annotation of protein residues based on a literature analysis:
cross-validation against UniProtKB
Kevin Nagel∗1 , Antonio Jimeno1 , Tom Oldfield1 and Dietrich Rebholz-Schuhmann1

1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK



Email: Kevin Nagel∗ - auyeung@ebi.ac.uk;

∗ Corresponding author




Abstract
Background: A protein annotation database, such as the Universal Protein Resource (UniProtKB), is a valuable
resource for the validation and interpretation of predicted 3D structure patterns in proteins. Previously, results
have been on point mutation extraction methods from biomedical literature which can be used to support the
consuming work of manual database curation. However, these methods were limited on point mutation extraction
and do not extract features for the annotation of proteins at the residue level.
Results: This work introduces a system that identifies protein residue sites in abstract texts and annotate them
with features extracted from the context. The performances of all text mining modules were evaluated against
a manually annotated corpus. The identified annotation features can be attributed to at least one of six tar-
geted categories, e.g. enzymatic reaction. Extracted results were cross-validated against UniProtKB and for 13
annotations of residues that have not been confirmed in the UniProtKB a manual assessment was performed.
Conclusions: This work proposes a solution for the automatic extraction of protein residue annotation from biomed-
ical articles. The presented approach is an extension to other existing systems in that a wider range of residue
entities are considered and that features of residues are extracted as annotations.




Background                                                     ing existing databases. Clearly, annotations can be
                                                               used to verify data mined sequence/structure pat-
The understanding of the biological function of pro-           terns and likewise predicted patterns can be used to
teins remains to be a central challenge in biology.            search for association in the database. However, the
In protein science, sequence analysis of amino acids           major annotation effort at the current stage is the
or studies of their spatial distribution have led to           compilation of features at the protein level, while
predictions and discoveries of a number of biological          the actual target should be at the residue level, be-
significant patterns and motifs, e.g. metal-binding            cause biological function! s can be mapped to a de-
sites, catalytic triads, and ligand binding sites [1–7].       fined group of residues in proteins (function sites).
Complementary to these mined data is the prolifer-             This is also reflected in the field of automatic infor-
ation of protein annotations by extracting informa-            mation extraction from literature, where solutions
tion from biomedical articles in the view of updat-


                                                           1
have been published for the extraction of interac-            cally or predicted function in proteins can be better
tions of proteins [8, 9], subcellular protein localisa-       characterised.
tion [10], pathway discovery [11], and function anno-             The contribution of this work is the auto-
tation with Gene Ontology terminologies [12]. Few             matic extraction of protein residue annotation from
groups have investigated in point mutation extrac-            biomedical articles. Contextual information are ex-
tion, but without feature extraction for residue an-          ploited to identify features of residues that corre-
notation [13–17].                                             spond to one of six chosen target categories (SCAT,
    Works have been published that focused on the             Table 1). As a result, proteins can be selected with
extraction of point mutations, which is one type of           residues clustered by annotation types, which can
a residue entity [13–17]. The point mutation ex-              lead to discovery of, for example, evolutionary rela-
traction systems called MEMA [16] and MuteXt [17]             tionships.
use a dictionary lookup approach to detect protein
names and disambiguate multiple protein-residue
pairs with a word distance measurement. Mutation-
GraB [13], the successor of MuteXt, uses a graph bi-          Results and Discussion
gram method to calculate the proximity by weight-             The following sections assess first the extraction sys-
ing the association of word-pairs. Another applica-           tem and then the extracted data.
tion called MutationMiner [15] focuses on the inte-
gration of extracted point mutations into a protein
structure visualisation program.
                                                              Evaluation of the identification systems for men-
    These systems are all dedicated to the extrac-
                                                              tions of organism, protein and residues and their
tion of point mutations, but provide no extraction
                                                              associations.
of residue annotation. In a recent publication [14],
an ontological model was proposed that should hold            In order to evaluate the performance of the NER and
information extracted from MutationMiner as well              the AD systems used in this study, the results were
as point mutation annotations. However, the author            compared against the results from manual curation
did not provide any results of feature extraction nor         of a set of 100 Medline articles, i.e. the gold standard
was a strategy proposed. Residue annotation dif-              corpus (GC) generated as part of this study.
fers from functional annotation of proteins because               Table 2 (top) shows the performance of each
the biological role of a residue is described rather          named entity recognition. With an F1 measure of
in a biochemical context, which is then revealed in           0.91 the performance of the residue tagger is within
the function or property of the protein. At present,          range of previous works where only the residue was
there is neither such an ontological model nor a ter-         identified as point mutation [13–17]. On the other
minological resource publicly available.                      hand, the performance of organism name recognition
    The goal of this research is the identification of        was lower with precision of 0.81 and recall of 0.72.
biological function of mined structure patterns of            The protein recognition has the lowest performance
proteins. For this purpose a novel approach that              (precision = 0.65, recall = 0.60 ). The relatively low
combines structure mining and text mining is pro-             recall is due to permutation and lexical variants in
posed. The results of the combined mining study               text that are not covered by the dictionaries.
will be published elsewhere. This paper reports                   The evaluation of the organism-protein-residue
on the text mining part and introduces a strategy             AD module shows that the algorithm of [17] is suit-
for the compilation of protein residue annotations            able for association detection. The performance has
that can be used for the interpretation of struc-             a precision of 0.83 and a recall of 0.33 (Table 2, bot-
ture patterns. The result demonstrates that tex-              tom). Two prominent reasons for the low recall is
tual information can be captured and used to aug-             the correct organism-protein association but with a
ment data in UniProtKB. Because the primary data              mismatch of protein sequence and residue, or the as-
resource is Medline, the extraction covers a broad            sociation of organism and protein was wrong in the
range of biomedical fields, but is limited to abstract        first instance.
texts. The biological community benefits from the                 The implemented association detection system is
extracted annotations, for example, in that data              able to extract associations in accordance to UniPro-
mined structure patterns can be interpreted biologi-          tKB.


                                                          2
Cross-validation of organism-protein association              42,943 PDB protein structure with a sub-fraction of
with UniProtKB.                                               42,653 having a unique corresponding UniProt pro-
In this section the evaluation was performed auto-            tein identifier (11,912). For each of these proteins
matically on a cross-validation test set (XC) derived         the whole Medline was scanned for abstracts with ex-
from the UniProt corpus (UC). From the 136,566 ci-            tracted organism-protein-residue associations. Fig-
tations listed in the UniProt a virtually complete set        ure 1 shows the comparison of the citation sets based
of 136,559 abstract texts were retrieved from Med-            on UniProtKB references and the whole Medline
line to build the UC. Subselection from UC to de-             analysis.
termine XC resulted in 5,253 abstract texts repre-                For 2,535 out of 11,912 proteins the extrac-
senting a range of diverse proteins (Table 3, top).           tion system found a total of 18,748 corresponding
Corresponding to this test corpus is the set of 70,401        PMIDs. Analysis with citation indices for this subset
triplet identifiers of UniProtID-TaxonomyID-PMID              of proteins revealed that 680 out of 18,748 PMIDs
(UTP) for the protein-organism association evalu-             were rediscoveries. The low number of rediscovery
ation and 68,008 triplet identifiers of UniProtID-            can be explained in that many annotations are done
ResidueID-PMID (URP) for the protein-residue as-              from sections only available in the full text. Al-
sociation (Table 3, middle and bottom).                       though the analysis was based on Medline abstract
     With a precision of 0.77 and recall of 0.08 (F1          texts, the extraction was already able to find for 21
= 0.14) the result for organism-protein association           percent of the target proteins a large number of ci-
extraction indicates that although the system seems           tations. With a precision of 0.83 (determined by
to extract correct relations with a reasonable num-           gold standard evaluation) the estimated number of
ber of TP the recall of the solution is too low to            TP from the novel discovered citations is 15,560. In
fully judge on the performance. The low recall is             context of the 16,560 references of the 2,535 pro-
best explained by missing information in the scien-           teins from UniProtKB, the extraction expands the
tific documents that would confirm the organism-              citation set by 1.94 fold.
protein association. The results shows that the strin-            The extraction system can be used to expand the
gent residue-sequence match resulted in a precision           citation list of UniProtKB/PDB by using only Med-
of 1.00 and recall of 0.14 (F = 0.25). The low recall         line abstract texts. In this experiment the estimated
can be explained by several factors: 1) differences           number of overlooked citations for a subset of tar-
between the protein sequence index between the au-            get proteins provide already a large set for feature
thor and the database; 2) changes in the sequence             extraction for the annotation of protein residues.
indexing rules by UniProtKB; 3) sequence variants
which have not been reported in the database yet;
4) false protein-organism association with the con-           Evaluation of feature extraction.
sequence of retrieving the incorrect sequence.                The detection of domain specific features was done
     Notice the evaluation of the extraction system           by a classification approach which required a labelled
was done on Medline abstracts for a range of diverse          reference set and a defined set of categories. The pre-
proteins indexed by UniProtKB as opposed to pre-              cision, recall and F1-measure values were calculated
vious works with extraction from full texts for a few         for each category and summarised in Table 4. Two
protein family examples. Therefore the results im-            sets of categories were tested, each with different but
plicate that the extraction from only abstract texts is       corresponding semantic categories: (1) the six tar-
possible for a number of different UniProt proteins.          geted categories (SCAT) and (2) the categories listed
                                                              in the feautre table in UniProtKB (FCAT).
                                                                  For SCAT, the classifiers for structure compo-
PDB citation enrichment.                                      nent, chemical modification, binding type yielded in
For each PDB protein entry a link to a corre-                 F1 measures of 0.69, 0.61, and 0.67. For FCAT the
sponding UniProt record is available. The AD sys-             top performing classifiers were: motif, variant, and
tem extracts only relations for proteins recorded             binding with similar F1 scores (0.62, 0.61, 0.58). The
in the UniProtKB. Therefore each Medline record               remaining classifiers are still usable for feature detec-
with a found o-p-r association can be added to                tion, as they had precision scores comparable to the
the citation set of the corresponding PDB entry.              top F1 performing classifiers: enzymatic activity and
At the state of this analysis, the PDB contained              cellular phenotype from SCAT, modified residue, ac-


                                                          3
tive site and site for FCAT. The figures indicate that        tion system (Table 7). By comparing the mined an-
the features used here are suitable for feature de-           notations with correspondent entries in the UniProt
tection and their classification. The performance             six out of 19 annotations were equivalent to exist-
of feature detection was tested on the gold stan-             ing information in the database (rediscovery). Fur-
dard corpus (GC). Sentences with residue mention-             ther, the semantic tags of the annotations, provided
ing were examined and where applicable suitable fea-          by the classification of extracted text features, are
tures were annotated manually and compared with               biologically meaningful. For example, “the putative
the extraction method. The number of validated                catalytic triad” is correctly tagged as enzymatic, be-
and non-validated features was determined and per-            cause it is a chemical reaction site and therefore a
formance measured.                                            requirement for enzymatic function. In this example,
    The performance shows that the classification ap-         the predicted semantic tag is equivalent to the cate-
proach for feature detection had a reasonable cover-          gory active site from the feature table in UniProt. In
age for SCAT and FCAT (recall of 0.61 and 0.59 for            another example, “major phosphorylation sites” was
SCAT and FCAT) but is imprecise in capturing the              evaluated as rediscovery of the database information
correct annotation (precision of 0.21 for both, Table         “Phosphothreonine; by MAPK” and “Phosphoser-
5). This is not surprising, considering that features         ine; by MAPK” while the predicted tag (structural
are expressed throughout the whole sentences, but             com! ponent) and the assigned category in UniProt
have different attachments to named entities.                 (modified residues) are not equivalent. This is still
    The association of residues and features was              valid, because both pieces of information describe
based on a syntactical analysis of their verbal and           the function of the residues as modification site,
prepositional relations by using a shallow language           while the predicted tag represented this as a sub-
parser. The approach was evaluated by the perfor-             structure and UniProt emphasises on the modifica-
mance of detecting all manually annotated residue-            tion of the residues.
feature pairs within the GC data set. With a preci-
sion of 0.54 and recall of 0.81 the performance of the
shallow parser suggests it is highly usable for residue           For the remaining 13 extracted annotations there
annotation extraction (Table 6). The low precision            are no equivalent information represented in the
is explained by the current implementation of the             UniProt. All are tagged with structural component
parser which returns relations with nested preposi-           which is biologically valid, for example, “highly con-
tional phrases, thus the calculated precision tends           served C-terminal region” is an important substruc-
to have a lower value. extraction performance de-             ture of the protein and the extraction can aid in de-
creases when additional extraction modules (NER,              termining evolutionary important residues of protein
AD, FE) were used. This shows that the extraction             families. However, the annotation “conserved phos-
of annotation is greatly sensitive to each extraction         phopantothenate binding” can arguably be discussed
modules.                                                      whether it should be tagged as structural component
    Despite the performance of each module can be             or binding.
improved, the result shows that the extraction sys-
tem can deliver residue annotations.
                                                                  In conclusion, the biological significance of the
                                                              extracted annotations were studied by comparison
Protein residue annotation extraction and com-                with annotations from UniProt for the extracted
parison with UniProtKB.                                       proteins from the gold standard corpus. From the
The extraction system in this study delivered clas-           comparison, the rediscovery data shows that the
sified features of protein residues from Medline as           used SCAT scheme and its feature sets are able to
annotations. This section provides examples of the            capture information correspondent to UniProt anno-
validity of the drawn annotations by comparing ex-            tations. The predicted semantic tags are biologically
tracted information from the gold standard corpus             valid and do not necessarily have to be equivalent to
with entries in the UniProtKB.                                the categories found in the database. On the other
     Within this experiment, four UniProt proteins            hand, the novel discovery data indicates a potential
with a total of 19 annotations from seven sentences           contribution of the extraction for the automatic an-
and five abstract texts were mined with the extrac-           notation of protein residues in UniProt.


                                                          4
Conclusions                                                     from the TM infrastructure at the EBI ( [18]).
The aim of this work was to compile protein
residue features from Medline texts as annotation
for UniProtKB proteins by combining a series of text            Identification of residue mentions from the text.
mining methods. Although the performances of each               The extraction of residue mentions follows ap-
module may not be at optimal level, the generated               proaches of previous publications [16, 17]. Sets
data output indicates that the strategy is able to de-          of regular expressions were constructed to identify
liver biological meaningful results. Cross-validation           three types of protein residue site mentions. The
with UniProtKB analysis indicate that the extrac-               first basic type is the single protein sequence site
tion contains novel information that can complement             reference which consists of a (wild-type) amino acid
and update the knowledge in UniProtKB and conse-                name, followed by the sequence position number
quently provide annotations for PDB protein struc-              (e.g. “Gly-12”, “arginine 4”, “Tyr74”, “Arg(53)”).
tures.                                                          A point mutation is the second type of residue site
    It is important to note that the extraction was             where the description details the change of an amino
done only on abstract texts from Medline. The ad-               acid at given position. The common notation is
vantage over full text is to exploit a publicly available       the wild-type amino acid name, the sequence po-
broad range of scientific publications but on the cost          sition followed by the substitution (e.g. “W77R”,
on the information level of abstract texts. However,            “Cys560Arg”, “ser-52->ala”, “ala2-methionine”).
the results demonstrate that even with abstract texts           Finally, the third type of residue site describes ei-
a vast amount of annotation can be obtained.                    ther a list of residues or an interaction pair (e.g.
    As with high performing NER, AD, and FE sys-                “Tyr 85 to Ser 85”, “Trp27–Cys29”). The common
tems become more available, this conceptual strat-              notation is an amino acid name, sequence position, a
egy in protein residue annotation extraction may                connection symbol or conn! ection word, amino acid
yield optimal results for the biological community.             name, and sequence position. In addition to the ab-
                                                                breviated notation residue sites can be expressed in
                                                                grammatical form (e.g. “isoleucine at position 3”,
                                                                “substitution of Ala at position 4 to Gly”, “Ser472
Methods                                                         to glutamic acid”).
The extraction of protein residue annotation from
text can be divided into three steps: 1) named entity
recognition (NER) and extraction of residue men-                Identification of associations between mentions of
tions, 2) association detection (AD) of related named           species, proteins and residues.
entities, 3) extraction of annotation features for as-          The identification of a residue can only be validated,
sociated entities.                                              if it is part of the protein sequence as it is reported
                                                                in a reference database (e.g., UniProtKB). This re-
                                                                quires that the protein mention in the text is further
NER for protein and species.                                    supported by evidence for the species under scrutiny
Named entity recognition for proteins was based                 to select the appropriate protein sequence from the
on an approach that combined dictionary lookup                  bioinformatics database; that excludes the risk of
with fuzzy matching and basic disambiguation [18–               using orthologous protein sequences. The associ-
20]. All protein names were collected from UniPro-              ation of organisms with proteins and the proteins
tKB/SwissProt. Names of species were extracted                  with residues was done based on the algorithm de-
from the NCBI Taxonomy references from UniPro-                  scribed by [17]. First, specie and protein mentions
tKB/SwissProt and then collecting scientific and                were associated by measuring the word distance be-
common names of the referenced organisms. The                   tween them. Associated proteins and their specie
dictionary was complemented with terminologies de-              mention form a pair that correctly specifies the pro-
scribing only the referenced genus and the collection           tein with a unique identifier in the reference database
of full organism name (genus + specie) augmented                (UniProtKB). If no match was found, the associa-
with abbreviated genus forms (first letter abbrevia-            tion was relaxed to genus matching resulting in a list
tion of genus + specie). Web services for the identifi-         of protein identifiers. In case of multiple organisms
cation of protein names and taxa names are available            matching, word proximity metric was used to pr!


                                                            5
efer the closest word-pair. The identifier was used           were estimated based on their mutual information
to retrieve the protein sequence from the database            score and the association between the multi-word
in order to validate the residue mention. According           term and a category was computed as the sum of the
to the algorithm proposed by [17], three cases can be         associations of its constituents. The categorization
distinguished: (1) the residue correctly matches the          of a multi-word term into one of the categories then
protein sequence, (2) several alternative sequences           amounts to t! he identification of the best fitting
are matching from a list of protein mentions (identi-         category for a term based on the term’s components.
fiers), and (3) no match can be found for the residue         The reference set for the relevant multi-word terms
in the available protein sequences. If several protein        was generated using maximal length noun phrase
sequences were relevant candidates, then again the            (MLNP) analysis based on two different sets of NPs
word distance metric was used to select the closest           that were extracted from an whole Medline abstract
word pairs.                                                   texts analyses: the first set consists of NPs that co-
                                                              occurred with residue mentions in the same sentence
                                                              without nested residue terms (NP(not r)), and the
Feature extraction for the annotation of residues.            second set represents NPs with nested residue terms
The origin of a biological function of a protein is           (NP(r)); since the co-occurrence with a residue may
group of residues and their experimental characteri-          indicate higher relevance. Once the set of MLNPs
sation are reported in scientific publications. In this       were extracted each NP was manually labelled us-
study the feature extraction process was divided into         ing three different categorization schemes. The first
two parts: in the first part the text was processed to        scheme is binary labelling (BCAT) to separate do-
extract NPs that served as candidate features, and            main relevant terms from non relevant ones. The
in the second part the extracted candidate features           second scheme uses six semantic categories identi-
were classified into categories of annotation features.       fied from a study on the manual categorization of
Noun phrases are specified as nominal forms in com-           residue annotations based on scientific content from
bination with adjective and adverb mentions (NP =             Medline (bottom-up approach). The identified cat-
Det? (Adj—Adv—N)* N ). Even though most NPs                   egories and their definitions are shown in Table 1
denote terms this is not always true [21].                    (SCAT). The final set was defined through a top-
    In the first part, the abstract text was split into       down approach by reusing categories described in
sentences and annotated with part-of-speech (pos)             the feature table of the UniProtKB data resource
tags using the cistagger which has a similar perfor-          for proteins (FCAT).
mance as the treetagger but it has an integration of
a large biomedical terminological resource. Then the
shallow parser described in [22] was applied to ex-           Generation of evaluation corpora.
tract verbal and prepositional dependencies. Since            For the evaluation of the extraction system, two test
this parser does not deal with prepositional attach-          corpora were generated using the UniProt corpus
ment ambiguity it has been extended with a prepo-             (UC). The UC consists of those Medline abstract
sitional phrase attachment disambiguation module              texts that are cited in the UniProt database for rele-
explained in [23]. In the second part, the features           vant protein-residue pairs. The complete corpus was
were categorized using the endogenous classification          automatically analysed for organism, protein and
approach described in [24]. Basically, the algorithm          residue mentions and tagged appropriately. A gold
relies only on the mutual information of the lexical          standard corpus (GC) was created through manual
constituents of terms and their assigned categories.          curation since no corpora are available. A random
In contrast, the exogenous (corpus-based) approach            sample of 100 Medline abstract texts was drawn from
requires large amounts of contextual cues which are           the UC where every abstract had to fulfil the con-
difficult to obtain. The endogenous approach is               dition that a mention of an organism, a protein and
therefore more reliable to produce results even un-           a residue was present (tri-co-occurrence). All men-
der conditions of sparse data. During the training            tions of an organism, a protein, the residue, the as-
phase, lexical constituents of multi-word terms were          sociations between the mentions, and the contained
extracted from a labelled reference set and represent         features of the residues (see above) were then an-
features for a defined set of categories. The associa-        notated manually from two independent annotators
tion between both, the features and the categories,           with domain expertise. For the automatic evaluation


                                                          6
of extracted data a cross-validation corpus (XC) was             8. Marcotte E, Xenerios I, Eisenberg D: Mining litera-
derived from UC, because not all database informa-                  ture for protein-protein interactions. Bioinformat-
                                                                    ics 2001.
tion are necessarily expressed in abstract texts and
vice versa. Documents in UC were scanned for tri-                9. Blaschke C, Andrade M, Ouzounis C, Valencia A: Au-
                                                                    tomatic extraction of biological information from
occurrences of organism-protein-residue mentions in                 scientific text: Protein-protein interactions. Proc
text, and then analysed if the combinations of the                  Int Conf Intell syst Mol Biol 1999.
four identifiers UniProtID-TaxonomyID-ResidueID-                10. Stapley B, Kelley L, Sternberg M: Predicting the sub-
PMID can be found in the database. If at least a                    cellular location of proteins from text using sup-
single match was found the document was selected.                   port vector machines. Pac Symp Biocomput 2002.
For the non-matching combinations the correspond-               11. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky
                                                                    A: GENIES: A natural-language processing sys-
ing annotations were removed from text.
                                                                    tem for the extraction of molecular pathways from
                                                                    journal articles. Bioinformatics 2001.
                                                                12. Blaschke C, Leon EA, Krallinger M, Valencia A: Evalu-
Authors contributions                                               ation of BioCreAtIvE assessment of task 2. BMC
Kevin Nagel carried out the experiments, developed                  Bioinformatics 2005.
and implemented the methods, assessed the anno-                 13. Lee L, Horn F, Cohen F: Automatic extraction of
                                                                    protein point mutations using a graph bigram as-
tations, and drafted the manuscript. Antonio Ji-
                                                                    sociation. PLoS Computational Biology 2007.
meno participated in the development of the meth-
                                                                14. Witte R, Kappler T: Enhanced semantic access to
ods and drafted the manuscript. Dietrich Rebholz-                   the protein engineering literature using ontologies
Schuhmann participated in design of the exper-                      populated by text mining. Int. J. Bioinformatics Re-
iments, assessed the annotation and drafted the                     search and Applications 2007.
manuscript. All authors read and approved the final             15. Baker C, Witte R: Mutation Miner - Textual Anno-
manuscript.                                                         tation of Protein Structures. CERMM Symposium
                                                                    2005.
                                                                16. Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R,
                                                                    Casari G, Kirsch H: Automatic extraction of mu-
Acknowledgements                                                    tations from Medline and cross-validation with
We thank Kim Henrick, Michael Ashburner and Rob                     OMIM. Nucl. Acids Res. 2004.
Russell for their input in this project.                        17. Horn F, Lau A, Cohen F: Automated extraction of
                                                                    mutation data from the literature: application of
                                                                    MuteXt to G protein-coupled receptors and nu-
                                                                    clear hormone receptors. Bioinformatics 2004.
References                                                      18. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H,
1. Barker J, Thornton J: An algorithm for constraint                Jimeno A: Text processing through Web services:
   based structural template matching: application                  Calling Whatizit. Bioinformatics 2008.
   to 3D templates. Bioinformatics 2003.
                                                                19. Pezik P, Jimeno A, Lee V, Rebholz-Schuhmann D: Static
2. Oldfield T: Data Mining the Protien Data Bank:
                                                                    dictionary features for term polysemy identifica-
   Residue Interactions. Proteins 2002.
                                                                    tion. Building and evaluating resources for biomedical
3. Nebel J, Herzyk P, Gilbert D: Automatic generation               text mining, LREC Workshop 2008.
   of 3D motifs for classification of protein binding
   sites. BMC Bioinformatics 2007.                              20. Tsuruoka Y, Mcnaught J, Ananiadou S: Normalizing
                                                                    biomedical terms by minimizing ambiguity and
4. Kristensen D, Ward M, Lisewski A, Erdin S, Chen B,               variability. BMC Bioinformatics 2008, 9.
   Fofanov V, Kimmel M, Kavraki L, Lichtarge O: Predic-
   tion of enzyme function based on 3D templates of             21. Krauthammer M, Nenadic G: Term identification in
   evolutionarily important amino acids. BMC Bioin-                 the biomedical literature. J Biomed Inform 2004.
   formatics 2008.                                              22. Leroy G, Chen H, Martinez J: A shallow parser
5. Polacco B, Babbitt P: Automated discovery of 3D                  based on closed-class words to capture relations
   motifs for protein function annotation. Bioinfor-                in biomedical text. J Biomed Inform 2002.
   matics 2006.                                                 23. Schuman J, Bergler S: Postnominal Prepositional
6. Yoon S, Ebert J, Chung E, DeMicheli G, Altman R:                 Phrase Attachment in Proteomics. In Proceedings of
   Clustering protein environments for function pre-                the HLT-NAACL BioNLP Workshop on Linking Natu-
   diction: finding PROSITE motifs in 3D. BMC                       ral Language and Biology, Association for Computational
   Bioinformatics 2007.                                             Linguistics 2006.
7. Stark A, Sunyaev S, Russell R: A model for statistical       24. Cerbah F: Exogeneous and endogeneous ap-
   significance of local similarities in structure. J Mol           proaches to semantic categorization of unknown
   Biol 2003.                                                       technical terms. COLING 2000.


                                                            7
25. Gaizauskas R, Demetriou G, Artymiuk P, Willett P: Pro-          the Gene Ontology - uncoupling the web. Novartis
    tein structures and information extraction from                 Found Symp 2002.
    biological texts: the PASTA system. Bioinformat-
    ics 2003.                                                    27. Bairoch A: The ENZYME database in 2000. NAR
26. Ashburner M, Lewis S: On ontologies for biologists:              2000.



Figures
Figure 1: Comparison of UniProt indexed citations and discovered citations from Medline.
The extraction system identified for a subset of all UniProt proteins the triple associations of organism-
protein-residue in Medline abstract texts. The identified list of citations for these proteins were compared
with the citations references from the correspondent UniProt entries.

                                     Citation sets of Uniprot proteins
          KIT_MOUSE
                                                                                            UC citation
                                                                                            common citation
       NFKB1_MOUSE
                                                                                            RC citation
      BARD1_HUMAN

        THIO_HUMAN

       VPS36_HUMAN

         ETXB_STAAU

       Q7LZK5_BITAR

        VPS27_YEAST

       GP1BA_HUMAN

        TGFA_HUMAN

       XRCC1_HUMAN

         LCK_HUMAN

        FINC_HUMAN

         HFE_HUMAN

        TFR1_HUMAN

       ASPP2_HUMAN


                      0        100        200        300           400      500       600          700        800




Tables
Table 1: Six target categories of biological interest (SCAT).
The definition of each category of biological interest targeted in this study are listed together with their
references to databases for extracting candidate terminologies. A mapping of these categories to equiva-
lent/similar categories from UniProtKB (FCAT) is provided.




                                                             8
     SCAT                      FCAT                        reference           definition
     structure component       domain,      motif,         PASTA [25]          Class denoting concepts that
                               topo dom, chain,                                represent pieces and parts of the
                               transmem, coil                                  protein structure.
     chemical modification     variant, mod res,           n/a                 Class denoting changes to the
                               peptide, var seq,                               protein sequence and the chem-
                               lipid                                           ical composition.
     structural modification   region, site                n/a                 Class denoting the changes to
                                                                               the protein structure without
                                                                               changes to the chemical compo-
                                                                               sition.
     binding type              binding,      metal,        GO [26]             Class denoting different physico-
                               disulfid, crosslnk,                             chemical forces leading to a
                               dna bind, np bind,                              bond formation between a pro-
                               zn fing, ca bind                                tein structure component and a
                                                                               chemical entity.
     enzymatic activity        act site                    EC [27], GO [26]    Types of enzymatic reactions as
                                                                               a subpart to protein functions.
     cellular phenotype        n/a                         n/a                 Class denoting different cellular
                                                                               phenotypes that can be affected
                                                                               by structural or compositional
                                                                               changes of a protein.


Table 2: Named entity recognition and association detection performance evaluated on gold standard
corpus.
Performance was measured in terms of precision, recall, and F1 measure. o = organism; p = protein; r =
residue; o-p-r = association of o, p and r.

     target   available   extracted   TP       precision       recall    F1
     o              123         109    88           0.81        0.72    0.76
     p              511         471   305           0.65        0.60    0.62
     r              202         222   197           0.87        0.96    0.91

     o-p-r          158         63        52       0.83         0.33    0.47


Table 3: Cross-validation of organism-protein-residue extraction with UniProtKB.
Automatic performance analysis of the extraction with UniProtKB as reference. Performance was measured
in terms of precision, recall, and F1 measure. UC = UniProt corpus; XC = cross validation corpus; UTP
= triplet identifiers of UniProtID-TaxonomyID-PMID; URP = triplet identifers of UniProtID-ResidueID-
PMID; o = organism; p = protein; r = residue; o-p = association of o and p; p-r = association of p and
r.




                                                           9
                                                                                                    ResID
     data     o    p   r     o-p     p-r    PMID       TaxID     UniProtID        conv       site       seq   range—pair
     UC                                    136,559     11,348      175,695      28,950    33,750      4,021        2,281
     UC        -   -   -                     6,532          0            0           0
     UC       +                            119,880     11,348      174,717      25,482    30,041      3,740        2,095
     UC            +                       129,792     11,328      175,695      28,932    33,723      3,991        2,278
     UC                +                    30,732      4,743      115,882      28,950    33,750      4,021        2,281
     UC       +    +                       119,653     11,328      174,717      25,470    30,014      3,713        2,092
     UC       +    +   +                    27,709      4,740      113,412      25,470    30,014      3,713        2,092

     XC                                      5,253      1,536        45,869      9,519     7,342       227           421
     XC        -   -   -                   131,306          0             0
     XC       +                              5,253      1,536        45,869      9,519     7,342       227           421
     XC            +                         5,253      1,536        45,869      9,519     7,342       227           421
     XC                +                     5,253      1,536        45,869      9,519     7,342       227           421
     XC       +    +                         5,253      1,536        45,869      9,519     7,342       227           421
     XC       +    +   +                     5,253      1,536        45,869      9,519     7,342       227           421
     XC       +    +          +              5,253      1,536        45,869      9,519     7,342       227           421
     XC       +    +   +      +              5,253      1,536        45,869      9,519     7,342       227           421
     XC       +    +   +      +       +       4506       1301          3937       8804      5783         0           329

                                                                    UTP
  data    o    p   r   o-p    p-r     available    extracted     common       precision   recall      F1
  XC      +    +        +               70,401         7,333        5,625          0.77    0.08      0.14

                                                                    URP
  data    o    p   r   o-p     p-r     available     extracted   common       precision   recall      F1
  XC      +    +   +    +       +        68,008           9504      9504           1.00    0.14      0.25


Table 4: Feature classification performance.
The classification of contextual features of residues mentioned in text was used to identify annotations and
to classify them into categories of biological interest. Cross validation was performed with training and test
sets with 3600 and 400 features, respectively. Performance was measured in terms of precision, recall and
F1 measure.




                                                           10
                                SCAT                                                          FCAT
     category                      recall       precision     F1          category           recall precision   F1
     structure component           0.8          0.6           0.69        motif              0.45   1           0.62
                                                                          domain             0.5    0.62        0.55

     chemical modification          0.73        0.52          0.61        variant            0.77    0.5        0.61
                                                                          lipid              0.4     1          0.57
                                                                          modified res       0.47    0.59       0.52
                                                                          peptide            0.11    0.29       0.16

     binding type                   0.68        0.67          0.67        binding            0.63    0.54       0.58
                                                                          crosslink          0.25    0.67       0.36
                                                                          disulfid           0.17    0.62       0.26
                                                                          metal              0.12    0.25       0.16

     structural modification        0.25        0.64          0.36        site               0.68    0.47       0.56
                                                                          region             0.59    0.46       0.52
     enzymatic activity             0.42        0.49          0.46        active site        0.48    0.5        0.49

     cellular phenotype             0.47        0.6           0.53        n/a


Table 5: Feature detection evaluated on gold standard corpus.
The classification method was used to identify features of interest. The performance in detecting manually
determined annotations was measured in terms of precision, recall and F1 measure. SCAT = feature detection
using the six target categories; FCAT = feature dection using categories from the feature table in UniProtKB.

     feature       available   extracted        common       precision     recall      F1
     SCAT                164         474           100            0.21      0.61      0.31
     FCAT                164         460            97            0.21      0.59      0.31


Table 6: Performance of residue-feature association detection evaluated on gold standard corpus.
The association of residue and annotation was done by shallow parsing and extracting verbal/prepositional
relations. The performance was measured in precision, recall and F1 measure. GC = gold standard corpus;
r = residue; f = feature; o = organism; p = protien; s = verbal/prepositional relation between r and f; o-p
= association between o and p; p-r = association between p and r.

                          extraction filter
     data      s      r     f o p o-p            p-r      avail    extr   comm         prc     rec     f1
     GC        +                                            88      132      68       0.52    0.77   0.62
     GC        +     +     +                                88       65      30       0.46    0.34   0.39
     GC        +     +     +   +   +                        82       62      27       0.44    0.33   0.38
     GC        +     +     +   +   +        +         +     82       93      19       0.20    0.23   0.22


Table 7: Comparison of extracted protein residue annotations with UniProtKB.
The extraction system delivered protein annotation from Medline abstracts. Example of extraction were
drawn from the gold standard corpus extraction.



                                                                  11
UniProtID   ResidueID   PMID       SCAT        extracted feature     FCAT       UniProt annotation
P40380      THR13       12135491   str comp    major      phospho-   mod res    Phosphothreonine;
                                               rylation sites for               by MAPK
                                               MAPK
“           SER19       ”          ”           ”                     ”          Phosphoserine; by
                                                                                MAPK
“           SER19       12135491   chem mod    negative effect       mutagen    S → E:reduces ac-
                                                                                tivity as a cdc2 in-
                                                                                hibitor; when asso-
                                                                                ciated with E-13

Q93K00      ASP123      12147465   enzymatic   the putative cat-     act site   nucleophile     (by
                                               alytic triad                     similarity)
“           HIS279      ”          ”           ”                     ”          proton acceptor (by
                                                                                similarity)
“           ASP250      ”          ”           ”                     ”          proton donor (by
                                                                                similarity)


Q93K00      GLU55       12147465   str comp    putative   oxyanion   n/a        n/a
                                               hole
“           TRP124      ”          ”           ”                     n/a        n/a

Q02809      W612        9617436    str comp    ”                     n/a        n/a

Q9HAB8      GLY43       12906824   str comp    conserved      ATP    n/a        n/a
                                               binding residues
“           SER61       ”          ”           ”                     n/a        n/a
“           GLY63       ”          ”           ”                     n/a        n/a
“           GLY66       ”          ”           ”                     n/a        n/a
“           PHE230      ”          ”           ”                     n/a        n/a
“           ASN258      ”          ”           ”                     n/a        n/a
“           ASN59       ”          ”           conserved phospho-    n/a        n/a
                                               pantothenate bind-
                                               ing
“           ALA179      ”          ”           ”                     n/a        n/a
“           ALA180      ”          ”           ”                     n/a        n/a
“           ASP183      ”          ”           ”                     n/a        n/a




                                          12