=Paper= {{Paper |id=Vol-1327/12 |storemode=property |title=Knowledge Representation of Protein PTMs and Complexes in the Protein Ontology: Application to Multi-Faceted Disease Analysis |pdfUrl=https://ceur-ws.org/Vol-1327/icbo2014_paper_32.pdf |volume=Vol-1327 |dblpUrl=https://dblp.org/rec/conf/icbo/RossTLDCCANW14 }} ==Knowledge Representation of Protein PTMs and Complexes in the Protein Ontology: Application to Multi-Faceted Disease Analysis== https://ceur-ws.org/Vol-1327/icbo2014_paper_32.pdf
                                                      ICBO 2014 Proceedings


     Knowledge Representation of Protein PTMs and
    Complexes in the Protein Ontology: Application to
           Multi-Faceted Disease Analysis

 Karen E. Ross1, Catalina O. Tudor1, Gang Li1, Ruoyao Ding1, Irem Celen1, Julie Cowart1, Cecilia N. Arighi1, Darren
                                          A. Natale2 and Cathy H. Wu1,2
                1
                    Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
                     2
                       Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
                                                       E-mail: ross@dbi.udel.edu



    Abstract—Alterations       in     protein     post-translational        have global effects on the balance of PTM forms in the cell,
modification (PTM) and PTM cross-talk are increasingly being                have been implicated in many diseases [1, 2]. Protein
appreciated as driving mechanisms of human disease. The                     phosphorylation, in particular, has been recognized as a central
Protein Ontology (PRO) is a valuable resource to study the                  disease-driving mechanism, leading to the development of
relation between PTM and disease because it represents                      kinase inhibitors as therapeutic agents [1]. With state-of-the-art
individual proteins and protein complex subunits at the                     text mining tools, it is now possible to extract detailed
proteoform (e.g., isoform, PTM form, and sequence variant)                  information about proteins, PTMs, and diseases from the
level, with links to their functional properties. We constructed a          literature on a large scale. Tools such as RLIMS-P [3] and eFIP
multi-relation network that represents knowledge obtained from
                                                                            [4] have captured a wealth of information on phosphorylated
large scale text-mining for phosphorylation-dependent protein-
protein interactions (PPIs) and their disease associations, built on
                                                                            protein forms, their modifying enzymes, and the functional
the PRO framework for representation of PTM forms,                          impact of phosphorylation, with a minimum of manual curator
complexes, and protein families, as well as their attributes and            effort.
relationships. We then conducted two case studies that                          Despite these advances, representing this information in a
demonstrate the use of PRO in disease analysis. (i) We performed            form that is useful for both human interpretation and
cross-species     comparisons      of     two     glioma-associated         computational reasoning is challenging. It is important not only
phosphorylated proteoforms of the human DNMT1 methylase,
                                                                            to link genetic variant information with its effects on protein
which revealed that the forms are not strictly conserved in
                                                                            sequence and function but also to capture the impact of
mouse, a frequently used glioma model system. (ii) We used
PRO-defined proteoforms of the oncoprotein beta-catenin
                                                                            imbalances of particular PTM forms and complexes on disease.
phosphorylated on various combinations of the N-terminal sites,                 Used in conjunction with other bioinformatic resources, the
Ser-33, Ser-37, Thr-41, and Ser-45, to interpret a hierarchical             Protein Ontology (http://pir.georgetown.edu/pro/pro.shtml;
clustering analysis of cancer types based on their pattern of               PRO [5]) enables the structured representation and
mutations in these sites. The cancers formed two major clusters:            interpretation of this information. PRO, a member of the Open
one with mutations in Ser-33/Ser-37/Thr-41 and the other with               Biomedical Ontologies (OBO) foundry, represents proteoforms
mutations in Thr-41/Ser-45. Proteoform-specific annotation in
                                                                            (e.g., isoforms, PTM forms, and sequence variants) [6] and
PRO suggests that stabilization of beta-catenin may play a role in
oncogenesis in the first group, whereas alterations in beta-catenin
                                                                            protein complexes and their relationships within and across
transcriptional or cell adhesion activity may play a more                   species. Once a proteoform or complex is defined in PRO it
important role in the second group. Together, these scenarios               can be annotated with functional and/or disease information
illustrate the general applicability of PRO to disease                      derived from the scientific literature or bioinformatic
understanding. Future plans include the integration of PRO with             databases. This framework can then support the analysis of
other semantic resources to increase our ability to address these           biological processes in health and disease.
problems with computational reasoning.
                                                                                Mouse models have been critical for understanding human
                                                                            diseases. These models rely on the high degree of conservation
    Keywords—Protein Ontology; phosphorylation; text mining;
beta-catenin; cancer                                                        of proteins and pathways between human and mouse. Although
                                                                            protein phosphorylation sites in human are also highly
                                                                            conserved in mouse (92% conserved in one study [7]),
                         I. INTRODUCTION                                    conservation at the proteoform level, which can have
   Aberrations in protein post-translational modification                   significant functional consequences in a disease model has not
(PTM), resulting from genetic variations that affect individual             been assessed. With its emphasis on representation of
PTM sites as well as alterations in PTM enzyme activity that




                                                                       43
                                                           ICBO 2014 Proceedings

proteoforms and cross-species relationships, PRO is a valuable                   results, including phospho-dependent PPIs, PTM enzyme-PTM
resource for this type of analysis.                                              form relationships, proteoform/complex-disease relationships,
                                                                                 and relations among PRO terms, is shown in Fig. 1.
    In this article, we leverage PRO to: (i) create a
phosphorylation network based on large-scale text mining of                          This network illustrates several ways in which the PRO
phosphorylation-dependent PPIs that impact disease; (ii)                         framework allows curators to precisely represent complex
facilitate cross-species comparison of proteoforms of the                        biological information. First, because PRO treats each
DNMT1 methylase that are associated with glioblastoma in                         proteoform as a separate entity, multiple forms of a protein can
humans; and (iii) interpret patterns of mutations in the beta-                   be defined and individually annotated. For example, the Tyr-
catenin oncoprotein observed in different cancer types. These                    357-phosphorylated form (PR:000037508) and the Ser-127
examples illustrate the variety of ways in which PRO can be                      phosphorylated form (PR:000037510) of YAP1 differ in their
used to represent disease knowledge and gain insight into                        ability to bind 14-3-3 proteins (PR:000003237) and the
disease mechanisms.                                                              apoptosis regulatory protein p73 (PR:O15350) and exhibit
                                                                                 different associations with cancer and Alzheimer’s Disease.
                            II. METHODS                                          PRO can also represent forms with multiple types of
                                                                                 modification, facilitating the description of PTM cross-talk
     To identify phosphorylation-dependent PPIs described in                     (e.g., CCND1 Thr-286-phosphorylated and ubiquitinated form
literature, all PubMed abstracts and PubMedCentral (PMC)                         (PR:000037512)). Second, PRO represents protein complexes
Open Access full-length articles were processed using eFIP,                      as distinct entities to which complex-specific annotation can be
which identifies mentions of phosphorylation-dependent PPIs.                     attached. Moreover, complex subunits are defined using PRO
The kinases, phosphorylated proteins (substrates), and                           terms, enabling the specification of which particular
interacting partners (interactants) were mapped to UniProtKB                     proteoforms (e.g., phosphorylated forms) are part of the
identifiers, when possible, using GeneNorm [8]. Disease                          complex. For example, the proteoform of the DNMT1 DNA
mentions in article titles or abstracts were computationally                     methlyase that lacks phosphorylation on Ser-127 and Ser-143
detected by matching to a custom dictionary of disease terms                     (PR:000037504) forms a complex with the DNA-associated
based on MeSH and MedlinePlus. PRO terms for proteoforms                         factors PCNA (PR:P12004) and UHRF1 (PR:Q96T88). This
and complexes were created as described in [9]. Term                             complex (PR:000037517) has been associated with tumor
annotation such as binding partners and disease association is                   suppression [14]. Third, protein terms in PRO are defined at
stored in the PRO annotation file (PAF), and PTM enzyme                          multiple levels of granularity from the family level down to the
information is recorded in the comments section of the OBO                       isoform and/or modification level. Thus, when describing a
stanza using structured vocabulary. All terms and annotations                    biological relationship involving a protein, the term that is
are available upon request and will be made public on the PRO                    most appropriate given the current state of knowledge can be
website and in downloadable files as part of PRO release 43                      used. For example, because 14-3-3 proteins are encoded by
(September 2014). The network was constructed using                              several genes, and the protein products of these genes are not
Cytoscape 3.0 [10]. Sequence alignment was performed using                       always distinguishable in experimental assays, 14-3-3 proteins
Clustal Omega [11] and visualized with Jalview 2.8.1 [12].                       are represented by the class PR:000003237 that encompasses
Cancer-associated mutations in beta-catenin were obtained                        the protein products of all 14-3-3 family genes. Similarly,
from Catalog of Somatic Mutations in Cancer (COSMIC) [13].                       when the protein is known to be the product of a particular
Tumor tissue types that had at least 10 mutations in the beta-                   gene, but no isoform information is available, a gene-level
catenin N-terminal phosphorylation sites Ser-33, Ser-37, Thr-                    PRO term that encompasses all protein products of a gene is
41, and Ser-45 were used. For each tissue the proportion of                      used (e.g., TP73 (PR:O15350)). Integration of PRO terms into
mutations at each site was calculated relative to the total                      a multi-relation network context further allows identification of
number of mutations at all four sites. The heatmap was                           proteoforms sharing common PTM enzymes (e.g., AKT
constructed using the heatmap.2 function of R (version 3.0.2;                    (PR:000029189)) or interacting partners (e.g., 14-3-3
http://www.r-project.org/) with default parameters.                              (PR:000003237)) or implicated in the same diseases.

                  III. RESULTS AND DISCUSSION                                    B. Cross-Species Comparison of Proteoforms of the Glioma-
                                                                                     Associated DNA Methylase, DNMT1
A. Network Representation of Disease-Associated
                                                                                 DNMT1 phosphorylation on Ser-127 or Ser-127/Ser-143 and
    Phoshorylation-Dependent Protein-Protein Interactions                        the concomitant reduction in binding to UHRF1 and PCNA
    Text-mining of over 23 million PubMed abstracts and                          (Fig. 1) has been associated with glioma in humans [14].
800,000 PMC open access full-length articles using eFIP                          Because the mouse is often used as a model system for
identified sentences describing PPIs that were dependent on the                  studying glioma, we investigated whether the glioma-
phosphorylation state of one of the interactants in over 13,000                  associated DNMT1 proteoforms are conserved in mouse as
articles. About 500 articles also had phosphorylation site                       well as in several other mammals (Fig. 2). PRO representation
information, UniProtKB-mapped substrates and interactants,                       of proteoforms enables the cross-species comparison of PTM
and mention of disease in the title or abstract. Through manual                  at the PTM-form level, which is more likely to reflect
curation of the 109 most recent articles, we found 52 disease-                   functional conservation than comparisons of the individual
associated phospho-dependent PPIs in 39 articles. (In the                        sites alone. While human Ser-143 is conserved across all
remaining articles the disease mention was not causally related                  species, Ser-127 is found only in other primates (red/pink
to the PPI.) A multi-relation network based on some of these
    This work has been supported by the National Science Foundation [ABI-
1062520] and the National Institutes of Health [5R01GM080646-08].


                                                                            44
                                                             ICBO 2014 Proceedings




  Fig. 1. Multi-relation network showing partial text mining results for disease-associated phosphorylation-dependent protein-protein interactions. PRO terms for
  proteoforms and complexes and Disease Ontology terms for diseases are indicated.


residues). Thus, neither the Ser-127 phosphorylated form nor
the Ser-127/Ser-143 phosphorylated form of DNMT1 (Fig. 1)                          C. Analysis of Cancer-Associated Mutations in beta-catenin
is strictly conserved in mouse, rat, dog, or cow, suggesting that                      Phosphorylation Sites
non-primates might use a different mechanism for regulating                            Beta-catenin is a multi-functional protein involved in cell-
DNMT1 interaction with PCNA and UHRF1. However,                                    cell adhesion and transcriptional regulation [16]. Several of its
mouse, rat, and dog (but not cow) have a serine at the adjacent                    key transcriptional targets drive cell proliferation, and
position (blue/light blue residues), which could potentially                       excessive beta-catenin transcriptional activity is oncogenic.
fulfill the same role as human Ser-127. This serine has been                       Beta-catenin stability is regulated by phosphorylation of four
shown to be phosphorylated in mouse in a high-throughput                           residues in the N-terminus. Casein kinase I phosphorylates Ser-
phospho-proteomic study [15]. Further studies are needed to                        45 of beta-catenin, which promotes the sequential
clarify the role of DNMT1 phosphorylation in a mouse glioma                        phosphorylation of Thr-41, Ser-37, and Ser-33 by GSK3-beta.
model system.                                                                      Phosphorylation at Ser-37 and Ser-33 enables recognition of
                                                                                   beta-catenin by the ubiquitin ligase beta-TrCP, which targets it
                                                                                   for degradation by the proteosome. Mutations in the four
                                                                                   phosphorylation sites stabilize the protein and have been
                                                                                   associated with cancer.
                                                                                       Through text mining with RLIMS-P and eFIP, we defined
                                                                                   four beta-catenin proteoforms phosphorylated at different
                                                                                   combinations of the N-terminal sites, with distinct sub-cellular
                                                                                   localizations, binding partners, and activities (Fig. 3A). To gain
                                                                                   insight into the role of these proteoforms in cancer, we used
                                                                                   data from COSMIC on cancer-associated mutations in these
                                                                                   sites to perform hierarchical clustering of different cancer types
                                                                                   (Fig. 3B). The cancers fall into two major clusters with
                                                                                   different mutation patterns. Cluster 1 cancers (pink box) are
                                                                                   characterized by mutations at Ser-33 and Ser-37, with few
                                                                                   mutations at Ser-45. Conversely, Cluster 2 cancers (blue box)
                                                                                   are predominantly mutated at Ser-45. Both clusters show
                                                                                   intermediate levels of Thr-41 mutations. The Ser-33/Ser-37
Fig. 2. Partial sequence alignment of the DNMT1 DNA methylase from                 mutation the pattern in Cluster 1 suggests that oncogenesis in
several mammalian species showing degree of conservation of the human              these cancer types is related to the lack of proteoforms 1 and 2.
phosphorylation sites Ser-127 and Ser-143.                                         Both of these forms are unstable due to their association with




                                                                              45
                                                             ICBO 2014 Proceedings




  Fig. 3. (A) Proteoforms of beta-catenin phosphorylated on several combinations of the N-terminal phosphorylation sites Ser-33, Ser-37, Thr-41, and Ser-45
  with partial functional annotation. (B) Hierarchical clustering of cancer types based on their pattern of mutations in these phosphorylation sites.

                                                                                     [3]  X. Yuan, et al., "An online literature mining tool for protein
                                                                                          phosphorylation," Bioinformatics, vol. 22, pp. 1668-1669, 2006.
the ubiquitin ligase beta-TrCP. Thus, beta-catenin stabilization                     [4] C.O. Tudor, et al., "The eFIP system for text mining of protein
may be playing an important role in these cancers. Kinases,                               interaction networks of phosphorylated proteins," Database (Oxford),
such as HIPK2, that can phosphorylate Ser-33 and Ser-37                                   vol. 2012, pp. bas044, 2012.
without prior phosphorylation of Ser-45 may be regulating                            [5] D.A. Natale, et al., "Protein Ontology: a controlled structured network
beta-catenin stability in these tissues [17]. Cluster 2 cancers                           of protein entities," Nucleic Acids Res, vol. 42, pp. D415-421, 2014.
have relatively few mutations in the residues that bind beta-                        [6] L.M. Smith, N.L. Kelleher, and P. Consortium for Top Down,
                                                                                          "Proteoform: a single term describing protein complexity," Nat
TrCP; instead, these cancers are associated with lack of Ser-45                           Methods, vol. 10, pp. 186-187, 2013.
phosphorylated proteoform 3. Unlike other beta-catenin
                                                                                     [7] R. Malik, E.A. Nigg, and R. Korner, "Comparative conservation
proteoforms, proteoform 3 is found in the nucleus and may be                              analysis of the human mitotic phosphoproteome," Bioinformatics, vol.
a key transcriptionally active form of beta-catenin [18]; this                            24, pp. 1426-1432, 2008.
proteoform can also bind to the adhesion molecule E-cadherin                         [8] C.H. Wei and H.Y. Kao, "Cross-species gene normalization by species
[19]. Thus, alterations in beta-catenin transcriptional and cell                          inference," BMC Bioinformatics, vol. 12 Suppl 8, pp. S5, 2011.
adhesion activity independent of beta-catenin levels may                             [9] K.E. Ross, et al., "Construction of protein phosphorylation networks by
contribute to Cluster 2 cancers. This example highlights the                              data mining, text mining and ontology integration: analysis of the
value of integrating experimental disease information from                                spindle checkpoint," Database (Oxford), vol. 2013, pp. bat038, 2013.
bioinformatic resources such as COSMIC with PRO                                      [10] M.E. Smoot, et al., "Cytoscape 2.8: new features for data integration and
                                                                                          network visualization," Bioinformatics, vol. 27, pp. 431-432, 2011.
representation of proteoforms to gain new insight into disease.
                                                                                     [11] F. Sievers and D.G. Higgins, "Clustal Omega, accurate alignment of
                                                                                          very large numbers of sequences," Methods Mol Biol, vol. 1079, pp.
             IV. CONCLUSIONS AND FUTURE WORK                                              105-116, 2014.
                                                                                     [12] A.M. Waterhouse, et al., "Jalview Version 2--a multiple sequence
     Through the structured representation of proteoforms and
                                                                                          alignment editor and analysis workbench," Bioinformatics, vol. 25, pp.
complexes PRO facilitates: (i) representation of proteoform-                              1189-1191, 2009.
disease relations identified by large-scale text mining; (ii)                        [13] S.A. Forbes, et al., "COSMIC: mining complete cancer genomes in the
cross-species comparisons at the proteoform level for                                     Catalogue of Somatic Mutations in Cancer," Nucleic Acids Res, vol. 39,
evaluation of the relevance of animal models of disease; and                              pp. D945-950, 2011.
(iii) interpretation of disease-associated mutation patterns.                        [14] E. Hervouet, et al., "Disruption of Dnmt1/PCNA/UHRF1 interactions
Currently, the PRO terms curated in this project can be viewed                            promotes tumorigenesis from human and mice glial cells," PLoS One,
on the PRO website by biologists interested in PTM, PPI, and                              vol. 5, pp. e11333, 2010.
disease relationships. We are working toward formalizing these                       [15] M. Trost, et al., "The phagosomal proteome in interferon-gamma-
                                                                                          activated macrophages," Immunity, vol. 30, pp. 143-154, 2009.
relationships and disseminating them in standard semantic web
                                                                                     [16] T. Valenta, G. Hausmann, and K. Basler, "The many faces and functions
format (e.g. RDF/XML) to enable computational reasoning and                               of beta-catenin," EMBO J, vol. 31, pp. 2714-2736, 2012.
hypothesis generation.
                                                                                     [17] E.A. Kim, et al., "Homeodomain-interacting protein kinase 2 (HIPK2)
                                                                                          targets beta-catenin for phosphorylation and proteasomal degradation,"
                              REFERENCES                                                  Biochem Biophys Res Commun, vol. 394, pp. 966-971, 2010.
                                                                                     [18] M.T. Maher, et al., "Beta-catenin phosphorylated at serine 45 is spatially
[1]   L.M. Graves, J.S. Duncan, M.C. Whittle, and G.L. Johnson, "The
                                                                                          uncoupled from beta-catenin phosphorylated in the GSK3 domain:
      dynamic nature of the kinome," Biochem J, vol. 450, pp. 1-8, 2013.
                                                                                          implications for signaling," PLoS One, vol. 5, pp. e10184, 2010.
[2]   B.M. Kessler, "Ubiquitin - omics reveals novel networks and
                                                                                     [19] M.C. Faux, et al., "Independent interactions of phosphorylated beta-
      associations with human disease," Curr Opin Chem Biol, vol. 17, pp. 59-
                                                                                          catenin with E-cadherin at cell-cell contacts and APC at cell
      65, 2013.
                                                                                          protrusions," PLoS One, vol. 5, pp. e14127, 2010.




                                                                                46