Scalable Text Mining Assisted Curation of Post-
  Translationally Modified Proteoforms in the Protein
                      Ontology
         Karen E. Ross and Darren A. Natale                        Cecilia Arighi, Sheng-Chih Chen, Hongzhan Huang,
              Protein Information Resource                         Gang Li, Jia Ren, Michael Wang, K. Vijay-Shanker
          Georgetown University Medical Center                                       and Cathy H. Wu
                 Washington, DC, USA                                Center for Bioinformatics and Computational Biology
             E-mail: ker25@georgetown.edu                                          University of Delaware
                                                                                     Newark, DE, USA

                                                The Protein Ontology Consortium


Abstract—The Protein Ontology (PRO) defines protein classes         [2]. It has long been appreciated that PTMs play a pivotal
and their interrelationships from the family to the protein form    role in protein function, regulating activity, localization, and
(proteoform) level within and across species. One of the unique     protein-protein interactions (PPIs), and that disruptions in
contributions of PRO is its representation of post-                 PTM can lead to disease [3]. Recent advances in proteomics
translationally modified (PTM) proteoforms. However,                have revealed that the majority of human proteins undergo
progress in adding PTM proteoform classes to PRO has been           PTM, often on many sites [3]. The ability of PRO to
relatively slow due to the extensive manual curation effort         represent the full variety of PTM proteoforms for each gene
required. Here we report an automated pipeline for creation of      product, including proteoforms with combinations of
PTM proteoform classes that leverages two phosphorylation-          multiple modifications, makes it an ideal resource for
focused text mining tools (RLIMS-P, which detects mentions of       understanding PTM cross-talk and PTM-regulated functions.
kinases, substrates, and phosphorylation sites, and eFIP, which
                                                                    Thus, a major focus of the PRO curation effort is to represent
detects       phosphorylation-dependent          protein-protein
                                                                    and annotate PTM proteoforms and identify corresponding
interactions (PPIs)) and our integrated PTM database,
iPTMnet. By applying this pipeline, we obtained a set of ~820       proteoforms across species (ortho-proteoforms).
substrate-site pairs that are suitable for automated PRO term           There are currently three curation pipelines for creation
generation with literature-based evidence attribution.              of proteoform classes in PRO: (1) bulk import of data from
Inclusion of these terms in PRO will increase PRO coverage of       other projects that characterize PTM proteoforms, including
species-specific PTM proteoforms by 50%. Many of these new          Reactome [4] and the Consortium for Top-Down Proteomics
proteoforms also have associated kinase and/or PPI
                                                                    [5]; (2) requests for individual terms needed for Gene
information. Finally, we show a phosphorylation network for
                                                                    Ontology annotation in model organism databases (e.g,
the human and mouse peptidyl-prolyl cis-trans isomerase
(PIN1/Pin1) derived from our dataset that demonstrates the
                                                                    Mouse Genome Database [6]) or for semantic tagging (e.g.,
biological complexity of the information we have extracted.         Alzforum [7]); and (3) in-house literature-based curation
Our approach addresses scalability in PRO curation and will         using a text mining assisted workflow [8]. The need for
be further expanded to advance PRO representation of                extensive manual review by domain experts has proved to be
phosphorylated proteoforms.                                         a major bottleneck in PRO curation. Moreover, coverage of
                                                                    PTM proteoforms in PRO reflects the organisms and
    Keywords—Protein Ontology (PRO), text mining, post-             pathways of interest to individual users. PRO presently
translational modification, proteoform, phosphorylation             contains ~2,550 PTM proteoform classes, including 1,700
                                                                    organism-specific terms and 850 organism-independent
                     I. INTRODUCTION                                parent classes. Of the organism-specific terms, about half
    The Protein Ontology (PRO) (proconsortium.org) [1] is           were created via bulk data import while the remainder were
an OBO Foundry ontology that defines classes of proteins            created on an individual basis.
and protein complexes and indicates how these classes                   We have previously used two PTM-focused text mining
interrelate. Classes defined in PRO can be either organism-         tools to assist with manual curation of PTM proteoforms.
independent or organism-specific and range in granularity           The first tool, RLIMS-P [9] detects mentions of kinase,
from more general protein family classes to more specific           substrate, and phosphorylation site in free text; the second,
proteoform classes (which account for the precise molecular         eFIP [10], detects causal relationships between
form of a protein, including specification of sequence or           phosphorylation and PPIs (e.g., the binding between Bad
splice variant and any post-translational modification [PTM])
pSer-136 and 14-3-3 in the sentence: Akt phosphorylates Bad               To normalize the gene/protein names in the text mining
at Ser136 and promotes the association of Bad with 14-3-3.            results to UniProtKB accession numbers (ACs), we use
PMID: 17342096). Although these tools have considerably               PubTator [15] and the UniProt ID mapping service [16].
speeded up expert curation by pinpointing relevant                    PubTator is a web interface that provides RESTful APIs to
information in the literature, they have an untapped potential        retrieve gene normalization results generated by GenNorm
in further automation of the curation process.                        [17]. For each PMID, a list of gene mentions and their
                                                                      normalized Entrez IDs is retrieved. The Entrez IDs are then
    Concurrent with our text mining work, we have
                                                                      mapped to UniProtKB ACs using mapping information
developed iPTMnet, an integrated resource for PTM network             retrieved from the UniProt website. Any Entrez IDs that
analysis    (http://research.bioinformatics.udel.edu/iptmnet/;
                                                                      cannot be mapped to a UniProtKB AC are discarded. To
[11]. iPTMnet integrates text mining results from RLIMS-P             improve data quality, we perform two integrity checks on the
and eFIP that have been automatically normalized (i.e., the
                                                                      normalized results: (1) for substrates, we confirm that the
proteins detected in text have been mapped to their                   mapped protein sequence has the correct residue at the
corresponding UniProtKB identifiers) with data from
                                                                      position that is reported to be phosphorylated (e.g, if the
multiple high-quality PTM resources (e.g., PhosphoSitePlus            phosphorylation site is Ser-100, we confirm that position 100
[3] and PhosphoGrid [12]), covering organisms from human
                                                                      of the mapped sequence is a serine); and (2) for kinases, we
to yeast.                                                             check whether the corresponding UniProtKB record contains
     Here we describe an automated workflow for creation of           the keyword "kinase."
PTM proteoforms in PRO that takes advantage of the
                                                                      B. Integration of Text Mining Results with PTM Database
information we have integrated in the iPTMnet database.
                                                                          Information: iPTMnet
Key components of the workflow include i) full scale
PubMed text mining using RLIMS-P/eFIP; ii) automatic                      iPTMnet (Fig. 1) integrates normalized results of full-
normalization of protein entities in the text mining output;          scale text mining from RLIMS-P and eFIP with PTM data
iii) validation of the text mining results by comparing to            from several expert curated PTM resources for visualization
information in expert curated PTM resources; and iv)                  and analysis of PTM networks. Underlying iPTMnet is an
automatic generation of PRO terms, including logical and              Oracle (11g release 2) database. The text mining results that
textual definitions, based on a standardized template. In our         are consumed by iPTMnet are the normalized RLIMS-P
first application of this approach, we identified ~820                results from all PubMed abstracts and the normalized eFIP
proteoforms with a single phosphorylation site that can be            results from all PubMed abstracts and PMC full-length
included in PRO. For many of these terms, we also                     articles. For data integration, gene/protein names from the
automatically     extracted    kinase    and/or     interactant       source databases, which are represented in a variety of
information, which can be used to annotate the terms. This            formats, are mapped to UniProtKB ACs. We used the
work reflects a significant advance in our efforts to represent       iPTMnet database as the source of PTM information for
the landscape of PTM proteoforms in PRO.                              PRO proteoform term curation (see below).

                          II. APPROACH                                C. Selection of PTM Proteoforms for Automated PRO
                                                                         Curation
A. Full Scale Text Mining and Entity Normalization                       To select PTM proteoforms for PRO curation (Fig. 1) we:
     We have developed the text mining tools RLIMS-P [9]
and eFIP [10] to mine kinase-substrate-site relationships and            •   Retrieved from the iPTMnet database all substrate-
phosphorylation-dependent PPIs, respectively, from free text.                site pairs that were captured by RLIMS-P and at least
The rule-based RLIMS-P has achieved F-scores (harmonic                       one PTM database based on the same PMID(s) (Fig.
mean between precision and recall [13]) of 0.91, 0.92, and                   1, Step 1a). We excluded PMIDs where multiple
0.95 for kinases, substrates, and sites, respectively, based on              phosphorylation sites were detected by RLIMS-P or
a corpus of PubMed abstracts [9]. It has been evaluated in                   by the corroborating database(s) because of the
the BioCreative Interactive Text Mining Task for usability                   difficulty of automatically determining whether a
and utility [14] and is being adopted for computer-assisted                  combinatoric PTM proteoform (simultaneous
literature-based curation by several databases. eFIP employs                 phosphorylation on multiple sites) or independent
RLIMS-P to detect mentions of phosphorylation and then                       singly phosphorylated proteofoms were being
examines one or two consecutive sentences for any mention                    described. We also discarded cases with conflicts
of proteins that interact with the substrate. The textual                    between the text mined and database information
position of this information relative to phosphorylation is                  (e.g., due to errors in automated species assignment).
then used to assess whether the phosphorylation event has a              •   Obtained normalized kinase and phosphorylation-
direct effect (positive or negative) on the interaction. In an               dependent interactant information from the iPTMnet
evaluation on 100 sections of full-length articles from the                  database for the selected substrate-site pairs (Fig. 1,
PMC Open Access collection, eFIP achieved an F-score of                      Step 1b). After manual validation, this information
84% [10]. Results of full-scale RLIMS-P/eFIP mining of                       can potentially be used to associate annotation with
PubMed abstracts and PubMed Central Open Access (PMC)                        the PRO terms.
articles are stored in a local database. The stored information
includes entities, relations, and evidence attribution.                  •   Excluded PMIDs where the abstract contains
                                                                             language that suggests that PTMs other than
    Funding: NSF (ABI-1062520), NIH (R01GM080646), Delaware
INBRE (P20GM103446), and institutional resources of Center for
Bioinformatics and Computational Biology at University of Delaware.
                                                                                normalization of protein entities, we obtained ~5,300
                                                                                normalized substrate-site pairs and ~1,550 kinase-substrate-
                                                                                site triples. Mining of PubMed abstracts and PMC full-length
                                                                                articles with eFIP identified ~8,500 articles with PTM-
                                                                                dependent PPI information; after normalization, we obtained
                                                                                ~770 substrate-site-interactant triples.
                                                                                    Of the ~5,300 substrate-site pairs from RLIMS-P, 1,033
                                                                                were curated by another resource in the iPTMnet database
                                                                                based on the same PMID(s). Of these, we eliminated 94
                                                                                because there was a conflict between the text mining results
                                                                                and the curated resource usually related to species
                                                                                assignment, 84 because the abstracts they were extracted
                                                                                from mentioned other PTMs and 78 because the site and/or
                                                                                PMID(s) were already in PRO (Note: some substrate-site
                                                                                pairs were eliminated for more than one of these reasons.)
                                                                                After these filtering steps, we obtained 818 substrate-site
Fig. 1. Workflow for automated generation of PRO terms for PTM                  pairs 2 potentially suitable for automated PRO term
proteoforms. Substrate-site pairs identified by RLIMS-P and supported by        generation. Of these, 731 (89%) have kinase information,
at least one other resource are retrieved from the iPTMnet database (1a)        including 285 (35%) with kinase information from RLIMS-
along with any pertinent kinase or PPI information (1b). Additional
filtering to remove cases that are already in PRO or are likely to be part of
                                                                                P, and 93 (11%) have PPI information (from eFIP), which
multiply modified proteoforms is performed (2). PRO stanzas are created         can be added to PRO as annotation after expert review.
based on a template (3) and annotation (e.g. PPIs) is added to the PRO
Annotation File (PAF; 4).
                                                                                    Two curators manually reviewed the full-text articles for
                                                                                91 substrate-site pairs randomly chosen from the list of 818
         phosphorylation are described (e.g., ubiquitin* and                    results. The number of results reviewed was determined by
         acetyl*). This check reduced the likelihood that the                   the time available to the curators. In 83 cases (91%), the
         proteoform has other PTMs in addition to the single                    evidence supported the existence of the singly-
         phosphorylation site (Fig 1, Step 2)                                   phosphorylated PTM proteoform identified by our automated
     •   Excluded cases where the substrate-site pair is                        approach. Of the remaining eight pairs, there was one case
         already in PRO, either as a singly phosphorylated                      where the species was assigned incorrectly by all sources
         proteoform or as part of a multiply modified form                      (text-mining and two databases) and seven cases where the
         (Fig. 1, Step 2). In addition, we excluded results that                article suggested that the proteoform had multiple
         were extracted from PMIDs that were already curated                    phosphorylation sites, even though only a single site was
         by PRO as we reasoned that all proteoforms that are                    captured by all sources. In one of the seven cases, the
         supported by those PMIDs are likely to have been                       phosphorylation required prior phosphorylation on another
         identified in the expert curation process.                             site; thus, the singly phosphorylated form we proposed is
                                                                                unlikely to exist. Using the RLIMS-P web interface [9], we
D. Automated Generation of PRO Stanzas                                          performed a keyword search for “priming”, a term
    PRO terms can be created for PTM proteoforms that pass                      commonly used to describe sequential phosphorylation
all data integrity checks using a template (Fig. 1 Step 3). If                  events, and found ~600 results (only 0.3% of total RLIMS-P
the substrate is mapped to a specific isoform of a protein, the                 results); also, our pipeline will filter out any of these cases
name and text definition will additionally include the isoform                  where multiple sites are mentioned in the abstract. Therefore,
number and the parent will be the organism-specific isoform.                    we think that this type of error will be relatively rare. In the
Associated kinase and/or PTM-dependent interactant                              other six cases, the existence of the singly phosphorylated
information (i.e., eFIP results) will be prioritized for expert                 form was not ruled out; moreover, it is acceptable to create a
review. Kinase information will be added to the stanza                          PRO term that names only a subset of the modification sites
comment line and interactant information will be added to                       in a multiply modified proteoform because, conformant to
the PRO Annotation File (PAF) following standard PRO                            the Open World Assumption [18], PRO does not make any
curation procedures (Fig 1, Step 4)1.                                           assertions about sites that are not explicitly named. PRO only
                                                                                asserts what is known based on the experimental results.
                   III. RESULTS AND DISCUSSION                                  Because the existence of other site modifications cannot be
                                                                                excluded, PRO definitions imply only that at least the
A. Identification of PTM Proteoforms for Automated PRO                          explicit modifications have to be present. Thus, our
   Curation.                                                                    evaluation indicates that our dataset is highly enriched for
   From full-scale text mining of 25 million PubMed                             well-supported singly phosphorylated forms while
abstracts with RLIMS-P, we identified ~185,000 papers with                      containing very few errors (2/91 (2%)).
kinase, substrate, and/or site information. After

     1                                                                             2
      PRO curation guidelines can be found on the PRO website                       List available at:
     (http://proconsortium.org).                                                   http://www.proteininformationresource.org/pro/iptmnet2pro.html
                                                                   organism-specific PRO terms for PTM proteoforms, a 50%
                                                                   increase over the number of species specific PTM forms
                                                                   currently curated by PRO. As the use case demonstrates, this
                                                                   approach can provide rich information on PTM sites, PTM
                                                                   enzymes, biological consequences of PTM (i.e. PTM-
                                                                   dependent PPI), and orthologous proteoforms across species.
                                                                   At the same time, the automatic detection and normalization
                                                                   of kinase and PPI information will greatly reduce the manual
                                                                   effort required for annotation of the automatically created
                                                                   PRO terms.
                                                                       In this study, we focused exclusively on data supported
                                                                   by text mining results; however, our approach could be
                                                                   applied to substrate-site pairs that are reported in any two
                                                                   resources in the iPTMnet database. We also plan to identify
                                                                   proteoform candidates from full-text RLIMS-P results. It has
                                                                   been observed that ~90% of phosphorylation sites are
                                                                   mentioned only in the body of an article (not the abstract) [9,
                                                                   20] so full-text mining should greatly increase our yield of
                                                                   proteoforms as well as improve data integrity. Finally, we are
                                                                   considering approaches for automated detection of
                                                                   proteoforms with multiple PTMs. It is often very challenging
                                                                   for a curator, let alone an automated system, to determine
                                                                   whether experimental evidence supports the existence of a
                                                                   proteoform with multiple PTMs as opposed to a population
 Fig. 2. Network of PIN1 kinases, PTM proteoforms,           and
 phosphorylated interacting proteins from human and mouse.         of proteins with individual modifications. One possibility
                                                                   would be to make use of PTM proteomic data. Bottom-up
                                                                   proteomic data is usually not useful for detecting PTM
B. Use Case: PIN1 Phosphorylation Network                          combinations because the proteins are cleaved into short
                                                                   peptides before identification. If a protein has several
    Fig. 2 shows a network centered on the peptidyl-prolyl         phosphorylated residues, they will typically be separated
cis-trans isomerase PIN1/Pin1 (human/mouse) that illustrates       across multiple peptides, making it impossible determine
the potential richness of the PTM information in our dataset       whether they were orignally present on the same protein
and the advantages of using an ontological representation of       molecule. However, if two phosphorylation sites on a protein
PTM proteoforms. PIN1/Pin1 recognizes a phosphorylated             are close enough, they could potentially be found on the
motif in its binding partners and induces a conformational         same peptide. In these cases, proteomic data could be used as
change [19]. Currently, the information in PRO about               evidence in support of the multiply modified proteoform.
PIN1/Pin1 is limited—no PTM proteoforms of PIN1/Pin1
are described and only one case of PIN1 binding to a                   In conclusion, we have implemented an automated
phosphoprotein (CCNE1 pSer-384, PR:000025637) is                   workflow using text mining results and curated database
annotated. In our dataset, we found two human PTM                  information to create new PRO terms for PTM proteoforms.
proteoforms (NFC1 pSer-345 and BAX pThr-167) that bind             This approach, which can achieve large gains in curation
to PIN1 in a phospho-dependent manner. Several kinases for         efficiency without compromising quality, can significantly
these proteoforms were identified, including MAPK1, which          expand the ontological representation of PTM.
phosphorylates both. In turn, we found three PTM
proteoforms of PIN1 (pSer-16, pSer-71, and pSer-138),
phosphorylated by multiple kinases. Interestingly, we also                                       REFERENCES
found a Ser-16 phosphorylated proteoform of mouse Pin 1.
The human and mouse pSer-16 proteoforms can be                     [1]   D.A. Natale, et al., "Protein Ontology: a controlled structured network
                                                                         of protein entities," Nucleic Acids Res, vol. 42, pp. D415-421, 2014.
connected at the ortho-proteoform level in the PRO hierarchy
                                                                   [2]   L.M. Smith, N.L. Kelleher, and P. Consortium for Top Down,
(Fig 2, grey node).                                                      "Proteoform: a single term describing protein complexity," Nat
                                                                         Methods, vol. 10, pp. 186-187, 2013.
C. Conclusions and Future Work
                                                                   [3]   P.V. Hornbeck, et al., "PhosphoSitePlus, 2014: mutations, PTMs and
    Here we describe a workflow for automatic generation of              recalibrations," Nucleic Acids Res, vol. 43, pp. D512-520, 2015.
PRO terms for PTM proteoforms based on text mining                 [4]   A. Fabregat, et al., "The Reactome pathway Knowledgebase," Nucleic
results with direct literature evidence attribution. When                Acids Res, vol. 44, pp. D481-487, 2016.
developing an automated curation pipeline, it is important to      [5]   X. Dang, et al., "The first pilot project of the consortium for top-down
minimize inclusion of erroneous information; thus, we used               proteomics: a status report," Proteomics, vol. 14, pp. 1130-1140,
stringent filtering criteria at the cost of discarding a great           2014.
majority (~85%) of our normalized substrate-site pairs. Even       [6]   C.J. Bult, et al., "The Mouse Genome Database: enhancements and
                                                                         updates," Nucleic Acids Res, vol. 38, pp. D586-592, 2010.
with strict filters in place, we will be able to create ~820 new
[7]  J. Kinoshita and T. Clark, "Alzforum," Methods Mol Biol, vol. 401,
     pp. 365-381, 2007.
[8] K.E. Ross, et al., "Construction of protein phosphorylation networks
     by data mining, text mining and ontology integration: analysis of the
     spindle checkpoint," Database (Oxford), vol. 2013, pp. bat038, 2013.
[9] M. Torii, et al., "RLIMS-P 2.0: A Generalizable Rule-Based
     Information Extraction System for Literature Mining of Protein
     Phosphorylation Information," IEEE/ACM Trans Comput Biol
     Bioinform, vol. 12, pp. 17-29, 2015.
[10] C.O. Tudor, et al., "Construction of phosphorylation interaction
     networks by text mining of full-length articles using the eFIP system,"
     Database (Oxford), vol. 2015, 2015.
[11] K.E. Ross, et al., "iPTMnet: Integrative Bioinformatics for Studying
     PTM Networks," Methods Mol Biol, vol. in press.
[12] I. Sadowski, et al., "The PhosphoGRID Saccharomyces cerevisiae
     protein phosphorylation site database: version 2.0 update," Database
     (Oxford), vol. 2013, pp. bat026, 2013.
[13] R. Rodriguez-Esteban, "Biomedical text mining and its applications,"
     PLoS Comput Biol, vol. 5, pp. e1000597, 2009.
[14] C.N. Arighi, et al., "An overview of the BioCreative 2012 Workshop
     Track III: interactive text mining task," Database (Oxford), vol. 2013,
     pp. bas056, 2013.
[15] C.H. Wei, H.Y. Kao, and Z. Lu, "PubTator: a web-based text mining
     tool for assisting biocuration," Nucleic Acids Res, vol. 41, pp. W518-
     522, 2013.
[16] C. UniProt, "Update on activities at the Universal Protein Resource
     (UniProt) in 2013," Nucleic Acids Res, vol. 41, pp. D43-47, 2013.
[17] C.H. Wei and H.Y. Kao, "Cross-species gene normalization by
     species inference," BMC Bioinformatics, vol. 12 Suppl 8, pp. S5,
     2011.
[18] R. Stevens, et al., "Using OWL to model biological knowledge,"
     International Journal of Human-Computer Studies, vol. 65, pp. 583-
     594, 2007.
[19] T.H. Lee, et al., "Death-associated protein kinase 1 phosphorylates
     Pin1 and inhibits its prolyl isomerase activity and cellular function,"
     Mol Cell, vol. 42, pp. 147-159, 2011.
[20] A.L. Veuthey, et al., "Application of text-mining for updating protein
     post-translational modification annotation in UniProtKB," BMC
     Bioinformatics, vol. 14, pp. 104, 2013.