ICBO 2014 Proceedings Representing Modification Sites in PRO Jonathan P. Bona Jenny Rouleau Alan Ruttenberg University at Buffalo Canisius College University at Buffalo Buffalo, NY Buffalo, NY Buffalo, NY Email: jpbona@buffalo.edu Email: rouleauj@canisius.edu Email: alanrutt@buffalo.edu Abstract—This paper presents a model for explicit represen- intended to succinctly express the fact that H3 histones can tations of amino acid sites in the protein ontology. We handle be modified by having the amino acid residue at position 3, sites that are the locations of post-translational modifications, which is Threonine, become phosphorylated. In this case, the focusing on histone proteins as our initial test case. The work information about the modification is linked to a page with explicitly represents both the entities involved (sites, residues, information about the kinase responsible for carrying out the etc), and commonly used information about those entities such phosphorylation. Also in this case the modification represented as positions relative to a reference sequence. has been observed to occur in some – but not all – histone H3 variants. However, the database does not appear to include a I. I NTRODUCTION term, page, or other explicit representation of the entity that is The Protein Ontology [1] (PRO) “... provides an onto- the Histone H3.3 variant modified in this way. logical representation of protein-related entities by explicitly This way of representing information about PTMs makes defining them and showing the relationships between them.” it difficult to use data from this source in combination with 1 Representable entities involved in a posttranslational mod- data from other sources without the intervention of a human ification (PTM) include: the protein itself, the amino acid being familiar with the domain. A further complication is residue(s), the modification process, the modifying enzymes, that even those sources that consistently name locations using chemical groups, and the location at which the modification the amino acid residue present there along with a number takes place. to indicate the distances (in units of one amino acid) often One type of protein-related entity that will benefit from have inconsistencies due to different treatment of the N- more explicit representations than are currently used is site, terminal initiator methionine (at position 1), which is usually or location. Examples include cleavage sites, domain binding removed. As an example, consider the UnitproKB[2] accession sites, sites in secondary structures, mutation sites, and modifi- P842433 , which is one of the H3 variants that undergoes the cation sites. PTM above called H3T3ph. This modification is better treated in Uniprot’s interface than in Histome, but Uniprot includes the This work focuses on representing sites of posttranslational initiator methionine in its position indices, which makes the modifications. Specifically, we deal with sites that contain representations used by the two resources inconsistent. amino acids that undergo chemical modifications, e.g. phos- phorylation, acetylation, and so on. The approach can be A. Example Competency Questions generalized to represent many types of relevant sites and their relationships to the proteins that host them, though we focus The following are examples of the sort of questions that it initially on modifications that involve change to a single amino would be useful to ask about protein modifications. Represen- acid. tations of modifications and the entities involved should allow these questions to be answered computationally, e.g. with the In addition to terms and relations about the biological use of SPARQL queries: entities involved, we also represent information that is typically included in existing descriptions of these entities, such as the • Given a protein, what PTMs does it have - what is numeric position of a residue on a reference sequence. the original residue type, modified residue type, and position with reference to uniprot sequence? II. E XISTING R EPRESENTATIONS OF M ODIFICATIONS & • What proteins in family X are known to have phos- S ITES phorylated sites? Existing resources that represent information about protein • Given a protein modification site, what other modifi- modifications do not generally make explicit representations cations of the site are known? of all of the entities involved. • Which types of protein have different functions con- For instance, the Histone Infobase and other sources use a ferred by being acetylated? string of characters like “H3T3ph” 2 to name a modification, • What PTMs are conserved across species? which is then further described in natural language text, along with PubMed IDs for provenance. The string “H3T3ph” is A major motivation for this work is to craft well-structured and accurate representations in OWL of the entities involved 1 http://pir.georgetown.edu/pro/pro.shtml 2 http://www.actrec.gov.in/histome/ptm sp.php?ptm sp=H3T3ph 3 http://www.uniprot.org/uniprot/P84243 60 ICBO 2014 Proceedings TABLE I. PR:000036802 that is generated by new assays, we aim to facilitate sharing PRO ID PR:000036802 and use of that data, and discovery of unknown facts it entails. PRO Name histone H3.3 acetylated and methylated 2 (mouse) Synonyms mH3F3B/AcMeth:2 (EXACT)PRO-short-label We have collected a preliminary histone modification data Definition A histone H3.3 that has been acetylated on the N- set from HIstome: The Histone Infobase 6 . This data includes terminal Met and methylated on several Lys residues in information on all five human histone types and fifty five mouse. UniProtKB:P84244, Met-1, MOD:00058—Lys-10/Lys- variants thereof, with one hundred and six observed modifica- 28, MOD:00083—Lys-19, MOD:00085. [TDR:PFR9332] tions, as well as disease associations. The modification types Comment Category=organism-modification. Note=Top down proteomics. Parent: PR:000008425 histone H3.3 included are arginine citrullination and methylation; lysine Hierarchical Children: none acetylation, biotinylation, methylation, ribosylation, ubiquiti- relationship only in taxon NCBITaxon:10090 Mus musculus nation; and phosphorylation of serine, threonine, and tyrosine. Histome PTM records specify the histone variant involved, the amino acid residues, and a location as the position in the in PTMs that will facilitate inferring and retrieving the answers amino acid sequence that makes up the primary structure of to such questions. the protein. They are also linked to UniprotKB accessions for the proteins involved. B. Existing Representations in PRO Table I shows the information already existing in PRO for a IV. R EPRESENTATION OF SITES particular modified H3.3 histone protein (http://purl.obolibrary. This section describes in detail our approach to represent- org/obo/PR 000036802). ing modification sites. While the basic scheme is in place, the By reading the name, text definition, and other fields, work continues to evolve as we are now adding representations and by following links to Uniprot or PSI-MOD, a reader of many different histone protein modifications based on data familiar with the subject matter comes to understand that the from the Histone Infobase site, and on data gathered by top- protein that this term stands for has three modifications to down proteomics. lysine residues, which are located at numeric positions 10, 19, Our work in progress on representing protein, and specif- and 28 relative to the N-terminus of a reference sequence. ically histone, modifications can be followed at: http://ctde. The counting scheme to produce these particular numeric net/page/Protein Modifications. The draft OWL document and positions seems to include the N-terminus Methionine, which related resources can be viewed from that page, or ac- is also described as being modified (acetylated). The linked cessed directly on the pro-ontology Google Code reposi- Uniprot page for this protein’s reference sequence describes tory at https://code.google.com/p/pro-ontology/source/browse/ the initiator Methionione as “removed.” The PSI-MOD ids trunk/src/ontology/protein-sites/protein-site.owl embedded in the definition text indicate more details about the nature of the modifications: MOD:000854 is defined as A. Site Classes A protein modification that effectively converts an L-lysine residue to N6-methyl-L-lysine, while MOD:000835 converts We represent PTM locations as subclasses of the Basic the L-lysine residues at positions 10 and 28 to N6,N6,N6- Formal Ontology [4][5] term bfo:site7 , which is elucidated trimethyl-L-lysine as: b is a site means: b is a three-dimensional immaterial entity that is (partially or wholly) bounded by a material entity or III. H ISTONE M ODIFICATION T EST C ASE it is a three-dimensional immaterial part thereof. (axiom label in BFO2 Reference: [034-002]). This work uses as a representation test case PTMs in histone proteins. Chromatin in the nucleus of eukaryotic cells We reify PTM locations using the classes is comprised of histones together with DNA. Posttranslational modifications to the histones change the structure of the chro- • amino acid chain site, matin and are hypothesized to form a “code” of downstream • site of an amino acid residue in a effects, including changes to the transcription of DNA[3]. protein, and Because histone modifications are of particular signifi- • site of post translationally cance, there is increasing interest in discovering and cat- modified amino acid residue, aloging the possible modifications, their combinations, and their functions. There are relatively few histone types, even Figure 1 shows the subclass relations between these and their including known variants. The situation is thus that there superclass, bfo:site. are a few proteins that play a central role in the nucleus of eukaryotic cells, and modifications to which are believed to form a complex code affecting the cell’s behavior, and for which there are ongoing efforts to collect ever more data. By developing correct, precise, and computable representation schemes for these facts, and using those to represent existing knowledge about histone modifications, as well as knowledge Fig. 1. Types of Sites 4 http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00085 6 http://www.actrec.gov.in/histome/index.php 5 http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00083 7 http://purl.obolibrary.org/obo/BFO 0000029 61 ICBO 2014 Proceedings A protein is made up of a chain of amino acid residues. a protein, into an N6-methyl-L-lysine residue Each amino acid residue occupies (bfo:has_location) a (RESID:AA0076 12 ) that is located at some site of site. The PRO term protein8 is a subclass of amino acid post translationally modified amino acid chain, defined as A molecular entity that is a polymer of residue. amino acids linked by peptide bonds [PRO:DAN] These facts are currently given in PSI-MOD as part Many but not all amino acid chains are part of of text definitions (e.g. A protein modification that ef- proteins. Every instance of site of an amino acid fectively converts an L-lysine residue to N6-methyl-L- residue in a protein and its subclass site of lysine 13 ) and as unstructured xref_definitions: post translationally modified amino acid RESID:AA0076 - the modification’s output - is listed among residue is part_of some protein. five other xref_definitions in the PSI-MOD entry for MOD:00085. Figure 1 also shows the term mouse histone H3.3 site, which we have defined as The site of an amino acid residue in a mouse histone H3.3 protein. Each instance of mouse histone H3.3 site is part_of some amino acid chain9 and is the location Fig. 3. Residues of some amino acid residue10 . Any instance of site of an amino acid residue in a protein is occupied_by some amino acid residue. Any instance of site of post translationally modified amino acid residue is occupied_by some modified residue. Fig. 2. Mouse Histone H3.3 Sites C. Modified Proteins The subclasses of mouse histone H3.3 site, shown in Figure 2 represent specific sites. For example, Figure 4 shows the term for the modified protein mouse histone H3.3 Lys-19 site is a specific site. PR:000036802 The label used for the term is mnemonic to suggest to a human reader that this is a site on a mouse histone H3.3, that it contains a lysine residue, and that it is in a particular location with respect to the N-terminus. However, the label itself should not be taken as a representation of these facts. Fig. 4. Histone H3.3 acetylated and methylated 2 (mouse) B. Residues Figure 5 shows part of the OWL definition for this term using the representation discussed above. Recall from Table 1 and earlier discussion of PR:000036802 (histone H3.3 acetylated and histone H3.3 acetylated and methylated methylated 2 (mouse)) that PSI-MOD IDs are 2 (mouse) is that particular modified protein, currently attached to the entry for that protein, and that the which is a subclass of histone (mouse) that PSI-MOD IDs seem to denote the processes of modification has four sites of modified residue. One of rather than, say, the resulting residues. MOD:00085, for those modified residues is a mouse histone instance, is defined as A protein modification that effectively H3.3 Lys-10 site that is occupied_by some converts an L-lysine residue to N6-methyl-L-lysine. This text N6,N6,N6-trimethyl-L-lysine residue 14 . definition makes reference to the residue that results from this type of modification, but MOD:00085 itself stands for the modification, not the modified residue. We are using RESID terms for amino acid residues, as shown in Figure 3. We have defined modified residue, which is a subclass of the CHEBI[6] term amino acid residue. Subclasses of modified residue are amino acid residues that are the output of some modification process. A MOD:00085 (process) involves the conversion of an L-lysine residue (RESID:AA001211 ), which is located at some site of an amino acid residue in Fig. 5. Definition of Histone H3.3 acetylated and methylated 2 (mouse) 8 http://purl.obolibrary.org/obo/PR 000000001 9 http://purl.obolibrary.org/obo/PR 000018263 12 http://pir.georgetown.edu/cgi-bin/resid?id=AA0076 10 http://purl.obolibrary.org/obo/CHEBI 33708 13 http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00085 11 http://pir.georgetown.edu/cgi-bin/resid?id=AA0012 14 http://pir.georgetown.edu/cgi-bin/resid?id=AA0074 62 ICBO 2014 Proceedings D. Position within a Reference Sequence VI. C ONCLUSIONS AND F UTURE W ORK The position within a reference sequence of a particular We have outlined and implemented a representation of amino acid residue / modification site is a key piece of posttranslational modifications that explicitly accounts for the information currently used to identify such sites in many biological entities involved and information about them. This different resources. We represent such descriptions as Infor- includes representations of sites, residues, and modified vari- mation Artifact Ontology (IAO)15 information content ants as well as of information artifacts such as descriptions of entitys. For instance, the Uniprot reference sequence sites relative to reference sequences. P8424416 corresponds to histone H3.3 (mouse). We This representation scheme is ready to be applied to a use the class site location within reference larger set of test data, some of which is not currently present sequence of uniprot P84244 for positions relative in PRO and some of which is currently present in a less to that sequence. Members of that class are named indi- structured form. Immediate future work involves translating viduals like position 19 in reference sequence histone modification data from the Histone Infobase and from of mouse histone H3.3, which is about a specific top-down proteomics, integrating those data with the relevant mouse histone H3.3 site, namely mouse histone records in PRO, and creating new records where needed. Much H3.3 Lys-19 site. of the work can be scripted, but some will require manual curation as well. Other planned future work is to expand the scope of our representation of sites to include other types of modifications, mutation sites, cleavage sites, and so on. Though the basic Fig. 6. Information entities about sites scheme will remain the same – using terms that represent each type of site along with terms for related entities – each will present opportunities for interesting ontology work. For V. E XAMPLE Q UERY instance, mutations that involve the deletion of an amino acid residue raise non-trivial questions about what happens to the The following simple SPARQL query gets the subclasses site when the residue that occupied it is gone. of histone H3.3 (mouse) – h33mouse – and their sites that are subclasses of site of an amino acid R EFERENCES residue in a protein ressite: 17 That is: it an- swers the question what are the variants of histone H3.3 [1] D. A. Natale, C. N. Arighi, W. C. Barker, J. A. Blake, C. J. Bult, M. Caudy, H. J. Drabkin, P. D’Eustachio, A. V. Evsikov, H. Huang et al., (mouse) and the sites of their PTMs? “The protein ontology: a structured representation of protein forms and complexes,” Nucleic acids research, vol. 39, no. suppl 1, pp. D539–D545, PREFIX h33mouse: 2011. [2] The UniProt Consortium, “The universal protein resource (uniprot),” PREFIX ressite: Nucleic Acids Research, vol. 36, no. suppl 1, pp. D190–D195, 2008. [Online]. Available: http://nar.oxfordjournals.org/content/36/suppl SELECT ?var ?s 1/D190.abstract WHERE { [3] B. D. Strahl and C. D. Allis, “The language of covalent histone ?var rdfs:subClassOf+ h33mouse: . modifications,” Nature, vol. 403, no. 6765, pp. 41–45, 01 2000. ?var rdfs:subClassOf ?d . [Online]. Available: http://dx.doi.org/10.1038/47412 ?d owl:onClass/owl:intersectionOf/rdf:first ?s. [4] P. Grenon and B. Smith, “Snap and span: Towards dynamic spatial ?s rdfs:subClassOf+ ressite: . ontology,” Spatial cognition and computation, vol. 4, no. 1, pp. 69–104, } 2004. [5] P. Grenon, B. Smith, and L. Goldberg, “Biodynamic Ontology: Applying The results of this query are shown in Figure 7. Because the BFO in the Biomedical Domain,” D. Pisanelli, Ed. IOS Press, 2004, ontology document queried contains only the example modi- vol. 102, pp. 20–38. fied form discussed above (histone H3.3 acetylated [6] K. Degtyarenko, P. De Matos, M. Ennis, J. Hastings, M. Zbinden, A. Mcnaught, R. Alcántara, M. Darsow, M. Guedj, and M. Ashburner, and methylated 2 (mouse)), the set of results includes “Chebi: a database and ontology for chemical entities of biological only its modification sites. interest,” Nucleic acids research, vol. 36, no. suppl 1, pp. D344–D350, 2008. Fig. 7. SPARL results: variants and modification sites 15 https://code.google.com/p/information-artifact-ontology/ 16 http://www.uniprot.org/uniprot/P84244 17 The URI used here for site of an amino acid residue in a protein (http://purl.obolibrary.org/obo/PROXXX 0001001) is a temporary placeholder to be replaced by a real PRO ID when it is added to a PRO release 63