=Paper=
{{Paper
|id=Vol-429/paper-4
|storemode=property
|title=Annotation of protein residues based on a literature analysis: cross-validation against UniProtKB
|pdfUrl=https://ceur-ws.org/Vol-429/paper4.pdf
|volume=Vol-429
|dblpUrl=https://dblp.org/rec/conf/eccb/NagelJR08
}}
==Annotation of protein residues based on a literature analysis: cross-validation against UniProtKB==
Annotation of protein residues based on a literature analysis:
cross-validation against UniProtKB
Kevin Nagel∗1 , Antonio Jimeno1 , Tom Oldfield1 and Dietrich Rebholz-Schuhmann1
1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Email: Kevin Nagel∗ - auyeung@ebi.ac.uk;
∗ Corresponding author
Abstract
Background: A protein annotation database, such as the Universal Protein Resource (UniProtKB), is a valuable
resource for the validation and interpretation of predicted 3D structure patterns in proteins. Previously, results
have been on point mutation extraction methods from biomedical literature which can be used to support the
consuming work of manual database curation. However, these methods were limited on point mutation extraction
and do not extract features for the annotation of proteins at the residue level.
Results: This work introduces a system that identifies protein residue sites in abstract texts and annotate them
with features extracted from the context. The performances of all text mining modules were evaluated against
a manually annotated corpus. The identified annotation features can be attributed to at least one of six tar-
geted categories, e.g. enzymatic reaction. Extracted results were cross-validated against UniProtKB and for 13
annotations of residues that have not been confirmed in the UniProtKB a manual assessment was performed.
Conclusions: This work proposes a solution for the automatic extraction of protein residue annotation from biomed-
ical articles. The presented approach is an extension to other existing systems in that a wider range of residue
entities are considered and that features of residues are extracted as annotations.
Background ing existing databases. Clearly, annotations can be
used to verify data mined sequence/structure pat-
The understanding of the biological function of pro- terns and likewise predicted patterns can be used to
teins remains to be a central challenge in biology. search for association in the database. However, the
In protein science, sequence analysis of amino acids major annotation effort at the current stage is the
or studies of their spatial distribution have led to compilation of features at the protein level, while
predictions and discoveries of a number of biological the actual target should be at the residue level, be-
significant patterns and motifs, e.g. metal-binding cause biological function! s can be mapped to a de-
sites, catalytic triads, and ligand binding sites [1–7]. fined group of residues in proteins (function sites).
Complementary to these mined data is the prolifer- This is also reflected in the field of automatic infor-
ation of protein annotations by extracting informa- mation extraction from literature, where solutions
tion from biomedical articles in the view of updat-
1
have been published for the extraction of interac- cally or predicted function in proteins can be better
tions of proteins [8, 9], subcellular protein localisa- characterised.
tion [10], pathway discovery [11], and function anno- The contribution of this work is the auto-
tation with Gene Ontology terminologies [12]. Few matic extraction of protein residue annotation from
groups have investigated in point mutation extrac- biomedical articles. Contextual information are ex-
tion, but without feature extraction for residue an- ploited to identify features of residues that corre-
notation [13–17]. spond to one of six chosen target categories (SCAT,
Works have been published that focused on the Table 1). As a result, proteins can be selected with
extraction of point mutations, which is one type of residues clustered by annotation types, which can
a residue entity [13–17]. The point mutation ex- lead to discovery of, for example, evolutionary rela-
traction systems called MEMA [16] and MuteXt [17] tionships.
use a dictionary lookup approach to detect protein
names and disambiguate multiple protein-residue
pairs with a word distance measurement. Mutation-
GraB [13], the successor of MuteXt, uses a graph bi- Results and Discussion
gram method to calculate the proximity by weight- The following sections assess first the extraction sys-
ing the association of word-pairs. Another applica- tem and then the extracted data.
tion called MutationMiner [15] focuses on the inte-
gration of extracted point mutations into a protein
structure visualisation program.
Evaluation of the identification systems for men-
These systems are all dedicated to the extrac-
tions of organism, protein and residues and their
tion of point mutations, but provide no extraction
associations.
of residue annotation. In a recent publication [14],
an ontological model was proposed that should hold In order to evaluate the performance of the NER and
information extracted from MutationMiner as well the AD systems used in this study, the results were
as point mutation annotations. However, the author compared against the results from manual curation
did not provide any results of feature extraction nor of a set of 100 Medline articles, i.e. the gold standard
was a strategy proposed. Residue annotation dif- corpus (GC) generated as part of this study.
fers from functional annotation of proteins because Table 2 (top) shows the performance of each
the biological role of a residue is described rather named entity recognition. With an F1 measure of
in a biochemical context, which is then revealed in 0.91 the performance of the residue tagger is within
the function or property of the protein. At present, range of previous works where only the residue was
there is neither such an ontological model nor a ter- identified as point mutation [13–17]. On the other
minological resource publicly available. hand, the performance of organism name recognition
The goal of this research is the identification of was lower with precision of 0.81 and recall of 0.72.
biological function of mined structure patterns of The protein recognition has the lowest performance
proteins. For this purpose a novel approach that (precision = 0.65, recall = 0.60 ). The relatively low
combines structure mining and text mining is pro- recall is due to permutation and lexical variants in
posed. The results of the combined mining study text that are not covered by the dictionaries.
will be published elsewhere. This paper reports The evaluation of the organism-protein-residue
on the text mining part and introduces a strategy AD module shows that the algorithm of [17] is suit-
for the compilation of protein residue annotations able for association detection. The performance has
that can be used for the interpretation of struc- a precision of 0.83 and a recall of 0.33 (Table 2, bot-
ture patterns. The result demonstrates that tex- tom). Two prominent reasons for the low recall is
tual information can be captured and used to aug- the correct organism-protein association but with a
ment data in UniProtKB. Because the primary data mismatch of protein sequence and residue, or the as-
resource is Medline, the extraction covers a broad sociation of organism and protein was wrong in the
range of biomedical fields, but is limited to abstract first instance.
texts. The biological community benefits from the The implemented association detection system is
extracted annotations, for example, in that data able to extract associations in accordance to UniPro-
mined structure patterns can be interpreted biologi- tKB.
2
Cross-validation of organism-protein association 42,943 PDB protein structure with a sub-fraction of
with UniProtKB. 42,653 having a unique corresponding UniProt pro-
In this section the evaluation was performed auto- tein identifier (11,912). For each of these proteins
matically on a cross-validation test set (XC) derived the whole Medline was scanned for abstracts with ex-
from the UniProt corpus (UC). From the 136,566 ci- tracted organism-protein-residue associations. Fig-
tations listed in the UniProt a virtually complete set ure 1 shows the comparison of the citation sets based
of 136,559 abstract texts were retrieved from Med- on UniProtKB references and the whole Medline
line to build the UC. Subselection from UC to de- analysis.
termine XC resulted in 5,253 abstract texts repre- For 2,535 out of 11,912 proteins the extrac-
senting a range of diverse proteins (Table 3, top). tion system found a total of 18,748 corresponding
Corresponding to this test corpus is the set of 70,401 PMIDs. Analysis with citation indices for this subset
triplet identifiers of UniProtID-TaxonomyID-PMID of proteins revealed that 680 out of 18,748 PMIDs
(UTP) for the protein-organism association evalu- were rediscoveries. The low number of rediscovery
ation and 68,008 triplet identifiers of UniProtID- can be explained in that many annotations are done
ResidueID-PMID (URP) for the protein-residue as- from sections only available in the full text. Al-
sociation (Table 3, middle and bottom). though the analysis was based on Medline abstract
With a precision of 0.77 and recall of 0.08 (F1 texts, the extraction was already able to find for 21
= 0.14) the result for organism-protein association percent of the target proteins a large number of ci-
extraction indicates that although the system seems tations. With a precision of 0.83 (determined by
to extract correct relations with a reasonable num- gold standard evaluation) the estimated number of
ber of TP the recall of the solution is too low to TP from the novel discovered citations is 15,560. In
fully judge on the performance. The low recall is context of the 16,560 references of the 2,535 pro-
best explained by missing information in the scien- teins from UniProtKB, the extraction expands the
tific documents that would confirm the organism- citation set by 1.94 fold.
protein association. The results shows that the strin- The extraction system can be used to expand the
gent residue-sequence match resulted in a precision citation list of UniProtKB/PDB by using only Med-
of 1.00 and recall of 0.14 (F = 0.25). The low recall line abstract texts. In this experiment the estimated
can be explained by several factors: 1) differences number of overlooked citations for a subset of tar-
between the protein sequence index between the au- get proteins provide already a large set for feature
thor and the database; 2) changes in the sequence extraction for the annotation of protein residues.
indexing rules by UniProtKB; 3) sequence variants
which have not been reported in the database yet;
4) false protein-organism association with the con- Evaluation of feature extraction.
sequence of retrieving the incorrect sequence. The detection of domain specific features was done
Notice the evaluation of the extraction system by a classification approach which required a labelled
was done on Medline abstracts for a range of diverse reference set and a defined set of categories. The pre-
proteins indexed by UniProtKB as opposed to pre- cision, recall and F1-measure values were calculated
vious works with extraction from full texts for a few for each category and summarised in Table 4. Two
protein family examples. Therefore the results im- sets of categories were tested, each with different but
plicate that the extraction from only abstract texts is corresponding semantic categories: (1) the six tar-
possible for a number of different UniProt proteins. geted categories (SCAT) and (2) the categories listed
in the feautre table in UniProtKB (FCAT).
For SCAT, the classifiers for structure compo-
PDB citation enrichment. nent, chemical modification, binding type yielded in
For each PDB protein entry a link to a corre- F1 measures of 0.69, 0.61, and 0.67. For FCAT the
sponding UniProt record is available. The AD sys- top performing classifiers were: motif, variant, and
tem extracts only relations for proteins recorded binding with similar F1 scores (0.62, 0.61, 0.58). The
in the UniProtKB. Therefore each Medline record remaining classifiers are still usable for feature detec-
with a found o-p-r association can be added to tion, as they had precision scores comparable to the
the citation set of the corresponding PDB entry. top F1 performing classifiers: enzymatic activity and
At the state of this analysis, the PDB contained cellular phenotype from SCAT, modified residue, ac-
3
tive site and site for FCAT. The figures indicate that tion system (Table 7). By comparing the mined an-
the features used here are suitable for feature de- notations with correspondent entries in the UniProt
tection and their classification. The performance six out of 19 annotations were equivalent to exist-
of feature detection was tested on the gold stan- ing information in the database (rediscovery). Fur-
dard corpus (GC). Sentences with residue mention- ther, the semantic tags of the annotations, provided
ing were examined and where applicable suitable fea- by the classification of extracted text features, are
tures were annotated manually and compared with biologically meaningful. For example, “the putative
the extraction method. The number of validated catalytic triad” is correctly tagged as enzymatic, be-
and non-validated features was determined and per- cause it is a chemical reaction site and therefore a
formance measured. requirement for enzymatic function. In this example,
The performance shows that the classification ap- the predicted semantic tag is equivalent to the cate-
proach for feature detection had a reasonable cover- gory active site from the feature table in UniProt. In
age for SCAT and FCAT (recall of 0.61 and 0.59 for another example, “major phosphorylation sites” was
SCAT and FCAT) but is imprecise in capturing the evaluated as rediscovery of the database information
correct annotation (precision of 0.21 for both, Table “Phosphothreonine; by MAPK” and “Phosphoser-
5). This is not surprising, considering that features ine; by MAPK” while the predicted tag (structural
are expressed throughout the whole sentences, but com! ponent) and the assigned category in UniProt
have different attachments to named entities. (modified residues) are not equivalent. This is still
The association of residues and features was valid, because both pieces of information describe
based on a syntactical analysis of their verbal and the function of the residues as modification site,
prepositional relations by using a shallow language while the predicted tag represented this as a sub-
parser. The approach was evaluated by the perfor- structure and UniProt emphasises on the modifica-
mance of detecting all manually annotated residue- tion of the residues.
feature pairs within the GC data set. With a preci-
sion of 0.54 and recall of 0.81 the performance of the
shallow parser suggests it is highly usable for residue For the remaining 13 extracted annotations there
annotation extraction (Table 6). The low precision are no equivalent information represented in the
is explained by the current implementation of the UniProt. All are tagged with structural component
parser which returns relations with nested preposi- which is biologically valid, for example, “highly con-
tional phrases, thus the calculated precision tends served C-terminal region” is an important substruc-
to have a lower value. extraction performance de- ture of the protein and the extraction can aid in de-
creases when additional extraction modules (NER, termining evolutionary important residues of protein
AD, FE) were used. This shows that the extraction families. However, the annotation “conserved phos-
of annotation is greatly sensitive to each extraction phopantothenate binding” can arguably be discussed
modules. whether it should be tagged as structural component
Despite the performance of each module can be or binding.
improved, the result shows that the extraction sys-
tem can deliver residue annotations.
In conclusion, the biological significance of the
extracted annotations were studied by comparison
Protein residue annotation extraction and com- with annotations from UniProt for the extracted
parison with UniProtKB. proteins from the gold standard corpus. From the
The extraction system in this study delivered clas- comparison, the rediscovery data shows that the
sified features of protein residues from Medline as used SCAT scheme and its feature sets are able to
annotations. This section provides examples of the capture information correspondent to UniProt anno-
validity of the drawn annotations by comparing ex- tations. The predicted semantic tags are biologically
tracted information from the gold standard corpus valid and do not necessarily have to be equivalent to
with entries in the UniProtKB. the categories found in the database. On the other
Within this experiment, four UniProt proteins hand, the novel discovery data indicates a potential
with a total of 19 annotations from seven sentences contribution of the extraction for the automatic an-
and five abstract texts were mined with the extrac- notation of protein residues in UniProt.
4
Conclusions from the TM infrastructure at the EBI ( [18]).
The aim of this work was to compile protein
residue features from Medline texts as annotation
for UniProtKB proteins by combining a series of text Identification of residue mentions from the text.
mining methods. Although the performances of each The extraction of residue mentions follows ap-
module may not be at optimal level, the generated proaches of previous publications [16, 17]. Sets
data output indicates that the strategy is able to de- of regular expressions were constructed to identify
liver biological meaningful results. Cross-validation three types of protein residue site mentions. The
with UniProtKB analysis indicate that the extrac- first basic type is the single protein sequence site
tion contains novel information that can complement reference which consists of a (wild-type) amino acid
and update the knowledge in UniProtKB and conse- name, followed by the sequence position number
quently provide annotations for PDB protein struc- (e.g. “Gly-12”, “arginine 4”, “Tyr74”, “Arg(53)”).
tures. A point mutation is the second type of residue site
It is important to note that the extraction was where the description details the change of an amino
done only on abstract texts from Medline. The ad- acid at given position. The common notation is
vantage over full text is to exploit a publicly available the wild-type amino acid name, the sequence po-
broad range of scientific publications but on the cost sition followed by the substitution (e.g. “W77R”,
on the information level of abstract texts. However, “Cys560Arg”, “ser-52->ala”, “ala2-methionine”).
the results demonstrate that even with abstract texts Finally, the third type of residue site describes ei-
a vast amount of annotation can be obtained. ther a list of residues or an interaction pair (e.g.
As with high performing NER, AD, and FE sys- “Tyr 85 to Ser 85”, “Trp27–Cys29”). The common
tems become more available, this conceptual strat- notation is an amino acid name, sequence position, a
egy in protein residue annotation extraction may connection symbol or conn! ection word, amino acid
yield optimal results for the biological community. name, and sequence position. In addition to the ab-
breviated notation residue sites can be expressed in
grammatical form (e.g. “isoleucine at position 3”,
“substitution of Ala at position 4 to Gly”, “Ser472
Methods to glutamic acid”).
The extraction of protein residue annotation from
text can be divided into three steps: 1) named entity
recognition (NER) and extraction of residue men- Identification of associations between mentions of
tions, 2) association detection (AD) of related named species, proteins and residues.
entities, 3) extraction of annotation features for as- The identification of a residue can only be validated,
sociated entities. if it is part of the protein sequence as it is reported
in a reference database (e.g., UniProtKB). This re-
quires that the protein mention in the text is further
NER for protein and species. supported by evidence for the species under scrutiny
Named entity recognition for proteins was based to select the appropriate protein sequence from the
on an approach that combined dictionary lookup bioinformatics database; that excludes the risk of
with fuzzy matching and basic disambiguation [18– using orthologous protein sequences. The associ-
20]. All protein names were collected from UniPro- ation of organisms with proteins and the proteins
tKB/SwissProt. Names of species were extracted with residues was done based on the algorithm de-
from the NCBI Taxonomy references from UniPro- scribed by [17]. First, specie and protein mentions
tKB/SwissProt and then collecting scientific and were associated by measuring the word distance be-
common names of the referenced organisms. The tween them. Associated proteins and their specie
dictionary was complemented with terminologies de- mention form a pair that correctly specifies the pro-
scribing only the referenced genus and the collection tein with a unique identifier in the reference database
of full organism name (genus + specie) augmented (UniProtKB). If no match was found, the associa-
with abbreviated genus forms (first letter abbrevia- tion was relaxed to genus matching resulting in a list
tion of genus + specie). Web services for the identifi- of protein identifiers. In case of multiple organisms
cation of protein names and taxa names are available matching, word proximity metric was used to pr!
5
efer the closest word-pair. The identifier was used were estimated based on their mutual information
to retrieve the protein sequence from the database score and the association between the multi-word
in order to validate the residue mention. According term and a category was computed as the sum of the
to the algorithm proposed by [17], three cases can be associations of its constituents. The categorization
distinguished: (1) the residue correctly matches the of a multi-word term into one of the categories then
protein sequence, (2) several alternative sequences amounts to t! he identification of the best fitting
are matching from a list of protein mentions (identi- category for a term based on the term’s components.
fiers), and (3) no match can be found for the residue The reference set for the relevant multi-word terms
in the available protein sequences. If several protein was generated using maximal length noun phrase
sequences were relevant candidates, then again the (MLNP) analysis based on two different sets of NPs
word distance metric was used to select the closest that were extracted from an whole Medline abstract
word pairs. texts analyses: the first set consists of NPs that co-
occurred with residue mentions in the same sentence
without nested residue terms (NP(not r)), and the
Feature extraction for the annotation of residues. second set represents NPs with nested residue terms
The origin of a biological function of a protein is (NP(r)); since the co-occurrence with a residue may
group of residues and their experimental characteri- indicate higher relevance. Once the set of MLNPs
sation are reported in scientific publications. In this were extracted each NP was manually labelled us-
study the feature extraction process was divided into ing three different categorization schemes. The first
two parts: in the first part the text was processed to scheme is binary labelling (BCAT) to separate do-
extract NPs that served as candidate features, and main relevant terms from non relevant ones. The
in the second part the extracted candidate features second scheme uses six semantic categories identi-
were classified into categories of annotation features. fied from a study on the manual categorization of
Noun phrases are specified as nominal forms in com- residue annotations based on scientific content from
bination with adjective and adverb mentions (NP = Medline (bottom-up approach). The identified cat-
Det? (Adj—Adv—N)* N ). Even though most NPs egories and their definitions are shown in Table 1
denote terms this is not always true [21]. (SCAT). The final set was defined through a top-
In the first part, the abstract text was split into down approach by reusing categories described in
sentences and annotated with part-of-speech (pos) the feature table of the UniProtKB data resource
tags using the cistagger which has a similar perfor- for proteins (FCAT).
mance as the treetagger but it has an integration of
a large biomedical terminological resource. Then the
shallow parser described in [22] was applied to ex- Generation of evaluation corpora.
tract verbal and prepositional dependencies. Since For the evaluation of the extraction system, two test
this parser does not deal with prepositional attach- corpora were generated using the UniProt corpus
ment ambiguity it has been extended with a prepo- (UC). The UC consists of those Medline abstract
sitional phrase attachment disambiguation module texts that are cited in the UniProt database for rele-
explained in [23]. In the second part, the features vant protein-residue pairs. The complete corpus was
were categorized using the endogenous classification automatically analysed for organism, protein and
approach described in [24]. Basically, the algorithm residue mentions and tagged appropriately. A gold
relies only on the mutual information of the lexical standard corpus (GC) was created through manual
constituents of terms and their assigned categories. curation since no corpora are available. A random
In contrast, the exogenous (corpus-based) approach sample of 100 Medline abstract texts was drawn from
requires large amounts of contextual cues which are the UC where every abstract had to fulfil the con-
difficult to obtain. The endogenous approach is dition that a mention of an organism, a protein and
therefore more reliable to produce results even un- a residue was present (tri-co-occurrence). All men-
der conditions of sparse data. During the training tions of an organism, a protein, the residue, the as-
phase, lexical constituents of multi-word terms were sociations between the mentions, and the contained
extracted from a labelled reference set and represent features of the residues (see above) were then an-
features for a defined set of categories. The associa- notated manually from two independent annotators
tion between both, the features and the categories, with domain expertise. For the automatic evaluation
6
of extracted data a cross-validation corpus (XC) was 8. Marcotte E, Xenerios I, Eisenberg D: Mining litera-
derived from UC, because not all database informa- ture for protein-protein interactions. Bioinformat-
ics 2001.
tion are necessarily expressed in abstract texts and
vice versa. Documents in UC were scanned for tri- 9. Blaschke C, Andrade M, Ouzounis C, Valencia A: Au-
tomatic extraction of biological information from
occurrences of organism-protein-residue mentions in scientific text: Protein-protein interactions. Proc
text, and then analysed if the combinations of the Int Conf Intell syst Mol Biol 1999.
four identifiers UniProtID-TaxonomyID-ResidueID- 10. Stapley B, Kelley L, Sternberg M: Predicting the sub-
PMID can be found in the database. If at least a cellular location of proteins from text using sup-
single match was found the document was selected. port vector machines. Pac Symp Biocomput 2002.
For the non-matching combinations the correspond- 11. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky
A: GENIES: A natural-language processing sys-
ing annotations were removed from text.
tem for the extraction of molecular pathways from
journal articles. Bioinformatics 2001.
12. Blaschke C, Leon EA, Krallinger M, Valencia A: Evalu-
Authors contributions ation of BioCreAtIvE assessment of task 2. BMC
Kevin Nagel carried out the experiments, developed Bioinformatics 2005.
and implemented the methods, assessed the anno- 13. Lee L, Horn F, Cohen F: Automatic extraction of
protein point mutations using a graph bigram as-
tations, and drafted the manuscript. Antonio Ji-
sociation. PLoS Computational Biology 2007.
meno participated in the development of the meth-
14. Witte R, Kappler T: Enhanced semantic access to
ods and drafted the manuscript. Dietrich Rebholz- the protein engineering literature using ontologies
Schuhmann participated in design of the exper- populated by text mining. Int. J. Bioinformatics Re-
iments, assessed the annotation and drafted the search and Applications 2007.
manuscript. All authors read and approved the final 15. Baker C, Witte R: Mutation Miner - Textual Anno-
manuscript. tation of Protein Structures. CERMM Symposium
2005.
16. Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R,
Casari G, Kirsch H: Automatic extraction of mu-
Acknowledgements tations from Medline and cross-validation with
We thank Kim Henrick, Michael Ashburner and Rob OMIM. Nucl. Acids Res. 2004.
Russell for their input in this project. 17. Horn F, Lau A, Cohen F: Automated extraction of
mutation data from the literature: application of
MuteXt to G protein-coupled receptors and nu-
clear hormone receptors. Bioinformatics 2004.
References 18. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H,
1. Barker J, Thornton J: An algorithm for constraint Jimeno A: Text processing through Web services:
based structural template matching: application Calling Whatizit. Bioinformatics 2008.
to 3D templates. Bioinformatics 2003.
19. Pezik P, Jimeno A, Lee V, Rebholz-Schuhmann D: Static
2. Oldfield T: Data Mining the Protien Data Bank:
dictionary features for term polysemy identifica-
Residue Interactions. Proteins 2002.
tion. Building and evaluating resources for biomedical
3. Nebel J, Herzyk P, Gilbert D: Automatic generation text mining, LREC Workshop 2008.
of 3D motifs for classification of protein binding
sites. BMC Bioinformatics 2007. 20. Tsuruoka Y, Mcnaught J, Ananiadou S: Normalizing
biomedical terms by minimizing ambiguity and
4. Kristensen D, Ward M, Lisewski A, Erdin S, Chen B, variability. BMC Bioinformatics 2008, 9.
Fofanov V, Kimmel M, Kavraki L, Lichtarge O: Predic-
tion of enzyme function based on 3D templates of 21. Krauthammer M, Nenadic G: Term identification in
evolutionarily important amino acids. BMC Bioin- the biomedical literature. J Biomed Inform 2004.
formatics 2008. 22. Leroy G, Chen H, Martinez J: A shallow parser
5. Polacco B, Babbitt P: Automated discovery of 3D based on closed-class words to capture relations
motifs for protein function annotation. Bioinfor- in biomedical text. J Biomed Inform 2002.
matics 2006. 23. Schuman J, Bergler S: Postnominal Prepositional
6. Yoon S, Ebert J, Chung E, DeMicheli G, Altman R: Phrase Attachment in Proteomics. In Proceedings of
Clustering protein environments for function pre- the HLT-NAACL BioNLP Workshop on Linking Natu-
diction: finding PROSITE motifs in 3D. BMC ral Language and Biology, Association for Computational
Bioinformatics 2007. Linguistics 2006.
7. Stark A, Sunyaev S, Russell R: A model for statistical 24. Cerbah F: Exogeneous and endogeneous ap-
significance of local similarities in structure. J Mol proaches to semantic categorization of unknown
Biol 2003. technical terms. COLING 2000.
7
25. Gaizauskas R, Demetriou G, Artymiuk P, Willett P: Pro- the Gene Ontology - uncoupling the web. Novartis
tein structures and information extraction from Found Symp 2002.
biological texts: the PASTA system. Bioinformat-
ics 2003. 27. Bairoch A: The ENZYME database in 2000. NAR
26. Ashburner M, Lewis S: On ontologies for biologists: 2000.
Figures
Figure 1: Comparison of UniProt indexed citations and discovered citations from Medline.
The extraction system identified for a subset of all UniProt proteins the triple associations of organism-
protein-residue in Medline abstract texts. The identified list of citations for these proteins were compared
with the citations references from the correspondent UniProt entries.
Citation sets of Uniprot proteins
KIT_MOUSE
UC citation
common citation
NFKB1_MOUSE
RC citation
BARD1_HUMAN
THIO_HUMAN
VPS36_HUMAN
ETXB_STAAU
Q7LZK5_BITAR
VPS27_YEAST
GP1BA_HUMAN
TGFA_HUMAN
XRCC1_HUMAN
LCK_HUMAN
FINC_HUMAN
HFE_HUMAN
TFR1_HUMAN
ASPP2_HUMAN
0 100 200 300 400 500 600 700 800
Tables
Table 1: Six target categories of biological interest (SCAT).
The definition of each category of biological interest targeted in this study are listed together with their
references to databases for extracting candidate terminologies. A mapping of these categories to equiva-
lent/similar categories from UniProtKB (FCAT) is provided.
8
SCAT FCAT reference definition
structure component domain, motif, PASTA [25] Class denoting concepts that
topo dom, chain, represent pieces and parts of the
transmem, coil protein structure.
chemical modification variant, mod res, n/a Class denoting changes to the
peptide, var seq, protein sequence and the chem-
lipid ical composition.
structural modification region, site n/a Class denoting the changes to
the protein structure without
changes to the chemical compo-
sition.
binding type binding, metal, GO [26] Class denoting different physico-
disulfid, crosslnk, chemical forces leading to a
dna bind, np bind, bond formation between a pro-
zn fing, ca bind tein structure component and a
chemical entity.
enzymatic activity act site EC [27], GO [26] Types of enzymatic reactions as
a subpart to protein functions.
cellular phenotype n/a n/a Class denoting different cellular
phenotypes that can be affected
by structural or compositional
changes of a protein.
Table 2: Named entity recognition and association detection performance evaluated on gold standard
corpus.
Performance was measured in terms of precision, recall, and F1 measure. o = organism; p = protein; r =
residue; o-p-r = association of o, p and r.
target available extracted TP precision recall F1
o 123 109 88 0.81 0.72 0.76
p 511 471 305 0.65 0.60 0.62
r 202 222 197 0.87 0.96 0.91
o-p-r 158 63 52 0.83 0.33 0.47
Table 3: Cross-validation of organism-protein-residue extraction with UniProtKB.
Automatic performance analysis of the extraction with UniProtKB as reference. Performance was measured
in terms of precision, recall, and F1 measure. UC = UniProt corpus; XC = cross validation corpus; UTP
= triplet identifiers of UniProtID-TaxonomyID-PMID; URP = triplet identifers of UniProtID-ResidueID-
PMID; o = organism; p = protein; r = residue; o-p = association of o and p; p-r = association of p and
r.
9
ResID
data o p r o-p p-r PMID TaxID UniProtID conv site seq range—pair
UC 136,559 11,348 175,695 28,950 33,750 4,021 2,281
UC - - - 6,532 0 0 0
UC + 119,880 11,348 174,717 25,482 30,041 3,740 2,095
UC + 129,792 11,328 175,695 28,932 33,723 3,991 2,278
UC + 30,732 4,743 115,882 28,950 33,750 4,021 2,281
UC + + 119,653 11,328 174,717 25,470 30,014 3,713 2,092
UC + + + 27,709 4,740 113,412 25,470 30,014 3,713 2,092
XC 5,253 1,536 45,869 9,519 7,342 227 421
XC - - - 131,306 0 0
XC + 5,253 1,536 45,869 9,519 7,342 227 421
XC + 5,253 1,536 45,869 9,519 7,342 227 421
XC + 5,253 1,536 45,869 9,519 7,342 227 421
XC + + 5,253 1,536 45,869 9,519 7,342 227 421
XC + + + 5,253 1,536 45,869 9,519 7,342 227 421
XC + + + 5,253 1,536 45,869 9,519 7,342 227 421
XC + + + + 5,253 1,536 45,869 9,519 7,342 227 421
XC + + + + + 4506 1301 3937 8804 5783 0 329
UTP
data o p r o-p p-r available extracted common precision recall F1
XC + + + 70,401 7,333 5,625 0.77 0.08 0.14
URP
data o p r o-p p-r available extracted common precision recall F1
XC + + + + + 68,008 9504 9504 1.00 0.14 0.25
Table 4: Feature classification performance.
The classification of contextual features of residues mentioned in text was used to identify annotations and
to classify them into categories of biological interest. Cross validation was performed with training and test
sets with 3600 and 400 features, respectively. Performance was measured in terms of precision, recall and
F1 measure.
10
SCAT FCAT
category recall precision F1 category recall precision F1
structure component 0.8 0.6 0.69 motif 0.45 1 0.62
domain 0.5 0.62 0.55
chemical modification 0.73 0.52 0.61 variant 0.77 0.5 0.61
lipid 0.4 1 0.57
modified res 0.47 0.59 0.52
peptide 0.11 0.29 0.16
binding type 0.68 0.67 0.67 binding 0.63 0.54 0.58
crosslink 0.25 0.67 0.36
disulfid 0.17 0.62 0.26
metal 0.12 0.25 0.16
structural modification 0.25 0.64 0.36 site 0.68 0.47 0.56
region 0.59 0.46 0.52
enzymatic activity 0.42 0.49 0.46 active site 0.48 0.5 0.49
cellular phenotype 0.47 0.6 0.53 n/a
Table 5: Feature detection evaluated on gold standard corpus.
The classification method was used to identify features of interest. The performance in detecting manually
determined annotations was measured in terms of precision, recall and F1 measure. SCAT = feature detection
using the six target categories; FCAT = feature dection using categories from the feature table in UniProtKB.
feature available extracted common precision recall F1
SCAT 164 474 100 0.21 0.61 0.31
FCAT 164 460 97 0.21 0.59 0.31
Table 6: Performance of residue-feature association detection evaluated on gold standard corpus.
The association of residue and annotation was done by shallow parsing and extracting verbal/prepositional
relations. The performance was measured in precision, recall and F1 measure. GC = gold standard corpus;
r = residue; f = feature; o = organism; p = protien; s = verbal/prepositional relation between r and f; o-p
= association between o and p; p-r = association between p and r.
extraction filter
data s r f o p o-p p-r avail extr comm prc rec f1
GC + 88 132 68 0.52 0.77 0.62
GC + + + 88 65 30 0.46 0.34 0.39
GC + + + + + 82 62 27 0.44 0.33 0.38
GC + + + + + + + 82 93 19 0.20 0.23 0.22
Table 7: Comparison of extracted protein residue annotations with UniProtKB.
The extraction system delivered protein annotation from Medline abstracts. Example of extraction were
drawn from the gold standard corpus extraction.
11
UniProtID ResidueID PMID SCAT extracted feature FCAT UniProt annotation
P40380 THR13 12135491 str comp major phospho- mod res Phosphothreonine;
rylation sites for by MAPK
MAPK
“ SER19 ” ” ” ” Phosphoserine; by
MAPK
“ SER19 12135491 chem mod negative effect mutagen S → E:reduces ac-
tivity as a cdc2 in-
hibitor; when asso-
ciated with E-13
Q93K00 ASP123 12147465 enzymatic the putative cat- act site nucleophile (by
alytic triad similarity)
“ HIS279 ” ” ” ” proton acceptor (by
similarity)
“ ASP250 ” ” ” ” proton donor (by
similarity)
Q93K00 GLU55 12147465 str comp putative oxyanion n/a n/a
hole
“ TRP124 ” ” ” n/a n/a
Q02809 W612 9617436 str comp ” n/a n/a
Q9HAB8 GLY43 12906824 str comp conserved ATP n/a n/a
binding residues
“ SER61 ” ” ” n/a n/a
“ GLY63 ” ” ” n/a n/a
“ GLY66 ” ” ” n/a n/a
“ PHE230 ” ” ” n/a n/a
“ ASN258 ” ” ” n/a n/a
“ ASN59 ” ” conserved phospho- n/a n/a
pantothenate bind-
ing
“ ALA179 ” ” ” n/a n/a
“ ALA180 ” ” ” n/a n/a
“ ASP183 ” ” ” n/a n/a
12