<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>&lt;http://purl.obolibrary.org/obo/PROXXX_0001001&gt; 2008. [Online]. Available: http://nar.oxfordjournals.org/content/36/suppl
SELECT ?var ?s 1/D190.abstract
WHERE { [3] B. D. Strahl and C. D. Allis</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1038/47412</article-id>
      <title-group>
        <article-title>Representing Modification Sites in PRO</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonathan P. Bona</string-name>
          <email>jpbona@buffalo.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Rouleau</string-name>
          <email>rouleauj@canisius.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan Ruttenberg</string-name>
          <email>alanrutt@buffalo.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Canisius College</institution>
          ,
          <addr-line>Buffalo, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University at Buffalo</institution>
          ,
          <addr-line>Buffalo, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>39</volume>
      <issue>6765</issue>
      <fpage>41</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>-This paper presents a model for explicit representations of amino acid sites in the protein ontology. We handle sites that are the locations of post-translational modifications, focusing on histone proteins as our initial test case. The work explicitly represents both the entities involved (sites, residues, etc), and commonly used information about those entities such as positions relative to a reference sequence.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>The Protein Ontology [1] (PRO) “... provides an
ontological representation of protein-related entities by explicitly
defining them and showing the relationships between them.”</p>
      <sec id="sec-1-1">
        <title>1 Representable entities involved in a posttranslational mod</title>
        <p>ification (PTM) include: the protein itself, the amino acid
residue(s), the modification process, the modifying enzymes,
chemical groups, and the location at which the modification
takes place.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>One type of protein-related entity that will benefit from</title>
      <p>more explicit representations than are currently used is site,
or location. Examples include cleavage sites, domain binding
sites, sites in secondary structures, mutation sites, and
modification sites.</p>
      <p>This work focuses on representing sites of posttranslational
modifications. Specifically, we deal with sites that contain
amino acids that undergo chemical modifications, e.g.
phosphorylation, acetylation, and so on. The approach can be
generalized to represent many types of relevant sites and their
relationships to the proteins that host them, though we focus
initially on modifications that involve change to a single amino
acid.</p>
    </sec>
    <sec id="sec-3">
      <title>In addition to terms and relations about the biological entities involved, we also represent information that is typically included in existing descriptions of these entities, such as the numeric position of a residue on a reference sequence.</title>
    </sec>
    <sec id="sec-4">
      <title>Existing resources that represent information about protein</title>
      <p>modifications do not generally make explicit representations
of all of the entities involved.</p>
      <p>For instance, the Histone Infobase and other sources use a
string of characters like “H3T3ph” 2 to name a modification,
which is then further described in natural language text, along
with PubMed IDs for provenance. The string “H3T3ph” is
1http://pir.georgetown.edu/pro/pro.shtml
2http://www.actrec.gov.in/histome/ptm sp.php?ptm sp=H3T3ph
intended to succinctly express the fact that H3 histones can
be modified by having the amino acid residue at position 3,
which is Threonine, become phosphorylated. In this case, the
information about the modification is linked to a page with
information about the kinase responsible for carrying out the
phosphorylation. Also in this case the modification represented
has been observed to occur in some – but not all – histone H3
variants. However, the database does not appear to include a
term, page, or other explicit representation of the entity that is
the Histone H3.3 variant modified in this way.</p>
      <p>This way of representing information about PTMs makes
it difficult to use data from this source in combination with
data from other sources without the intervention of a human
being familiar with the domain. A further complication is
that even those sources that consistently name locations using
the amino acid residue present there along with a number
to indicate the distances (in units of one amino acid) often
have inconsistencies due to different treatment of the
Nterminal initiator methionine (at position 1), which is usually
removed. As an example, consider the UnitproKB[2] accession</p>
      <sec id="sec-4-1">
        <title>P842433, which is one of the H3 variants that undergoes the</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>PTM above called H3T3ph. This modification is better treated in Uniprot’s interface than in Histome, but Uniprot includes the initiator methionine in its position indices, which makes the representations used by the two resources inconsistent.</title>
      <sec id="sec-5-1">
        <title>A. Example Competency Questions</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>The following are examples of the sort of questions that it</title>
      <p>would be useful to ask about protein modifications.
Representations of modifications and the entities involved should allow
these questions to be answered computationally, e.g. with the
use of SPARQL queries:</p>
    </sec>
    <sec id="sec-7">
      <title>Given a protein, what PTMs does it have - what is the original residue type, modified residue type, and position with reference to uniprot sequence?</title>
    </sec>
    <sec id="sec-8">
      <title>What proteins in family X are known to have phosphorylated sites?</title>
    </sec>
    <sec id="sec-9">
      <title>Given a protein modification site, what other modifications of the site are known?</title>
    </sec>
    <sec id="sec-10">
      <title>Which types of protein have different functions conferred by being acetylated?</title>
    </sec>
    <sec id="sec-11">
      <title>What PTMs are conserved across species?</title>
    </sec>
    <sec id="sec-12">
      <title>A major motivation for this work is to craft well-structured</title>
      <p>and accurate representations in OWL of the entities involved
3http://www.uniprot.org/uniprot/P84243
in PTMs that will facilitate inferring and retrieving the answers
to such questions.</p>
      <sec id="sec-12-1">
        <title>B. Existing Representations in PRO</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>By reading the name, text definition, and other fields,</title>
      <p>and by following links to Uniprot or PSI-MOD, a reader
familiar with the subject matter comes to understand that the
protein that this term stands for has three modifications to
lysine residues, which are located at numeric positions 10, 19,
and 28 relative to the N-terminus of a reference sequence.</p>
    </sec>
    <sec id="sec-14">
      <title>The counting scheme to produce these particular numeric</title>
      <p>positions seems to include the N-terminus Methionine, which
is also described as being modified (acetylated). The linked</p>
    </sec>
    <sec id="sec-15">
      <title>Uniprot page for this protein’s reference sequence describes</title>
      <p>the initiator Methionione as “removed.” The PSI-MOD ids
embedded in the definition text indicate more details about
the nature of the modifications: MOD:000854 is defined as</p>
      <sec id="sec-15-1">
        <title>A protein modification that effectively converts an L-lysine</title>
        <p>residue to N6-methyl-L-lysine, while MOD:000835 converts
the L-lysine residues at positions 10 and 28 to
N6,N6,N6trimethyl-L-lysine</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>This work uses as a representation test case PTMs in</title>
      <p>histone proteins. Chromatin in the nucleus of eukaryotic cells
is comprised of histones together with DNA. Posttranslational
modifications to the histones change the structure of the
chromatin and are hypothesized to form a “code” of downstream
effects, including changes to the transcription of DNA[3].</p>
      <p>Because histone modifications are of particular
significance, there is increasing interest in discovering and
cataloging the possible modifications, their combinations, and
their functions. There are relatively few histone types, even
including known variants. The situation is thus that there
are a few proteins that play a central role in the nucleus of
eukaryotic cells, and modifications to which are believed to
form a complex code affecting the cell’s behavior, and for
which there are ongoing efforts to collect ever more data.
By developing correct, precise, and computable representation
schemes for these facts, and using those to represent existing
knowledge about histone modifications, as well as knowledge
4http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00085
5http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00083
that is generated by new assays, we aim to facilitate sharing
and use of that data, and discovery of unknown facts it entails.</p>
      <p>We have collected a preliminary histone modification data
set from HIstome: The Histone Infobase 6. This data includes
information on all five human histone types and fifty five
variants thereof, with one hundred and six observed
modifications, as well as disease associations. The modification types
included are arginine citrullination and methylation; lysine
acetylation, biotinylation, methylation, ribosylation,
ubiquitination; and phosphorylation of serine, threonine, and tyrosine.</p>
    </sec>
    <sec id="sec-17">
      <title>Histome PTM records specify the histone variant involved,</title>
      <p>the amino acid residues, and a location as the position in the
amino acid sequence that makes up the primary structure of
the protein. They are also linked to UniprotKB accessions for
the proteins involved.</p>
      <p>IV.</p>
    </sec>
    <sec id="sec-18">
      <title>REPRESENTATION OF SITES</title>
    </sec>
    <sec id="sec-19">
      <title>This section describes in detail our approach to represent</title>
      <p>ing modification sites. While the basic scheme is in place, the
work continues to evolve as we are now adding representations
of many different histone protein modifications based on data
from the Histone Infobase site, and on data gathered by
topdown proteomics.</p>
      <p>Our work in progress on representing protein, and
specifically histone, modifications can be followed at: http://ctde.
net/page/Protein Modifications. The draft OWL document and
related resources can be viewed from that page, or
accessed directly on the pro-ontology Google Code
repository at https://code.google.com/p/pro-ontology/source/browse/
trunk/src/ontology/protein-sites/protein-site.owl</p>
      <sec id="sec-19-1">
        <title>A. Site Classes</title>
      </sec>
    </sec>
    <sec id="sec-20">
      <title>We represent PTM locations as subclasses of the Basic</title>
      <sec id="sec-20-1">
        <title>Formal Ontology [4][5] term bfo:site7, which is elucidated</title>
        <p>as: b is a site means: b is a three-dimensional immaterial entity
that is (partially or wholly) bounded by a material entity or
it is a three-dimensional immaterial part thereof. (axiom label
in BFO2 Reference: [034-002]).</p>
      </sec>
    </sec>
    <sec id="sec-21">
      <title>We reify PTM locations using the classes</title>
      <p>amino acid chain site,
site of an amino acid residue in a
protein, and
site of post translationally
modified amino acid residue,</p>
    </sec>
    <sec id="sec-22">
      <title>A protein is made up of a chain of amino acid residues.</title>
    </sec>
    <sec id="sec-23">
      <title>Each amino acid residue occupies (bfo:has_location) a</title>
      <p>site. The PRO term protein8 is a subclass of amino acid
chain, defined as A molecular entity that is a polymer of
amino acids linked by peptide bonds [PRO:DAN]</p>
    </sec>
    <sec id="sec-24">
      <title>Many but not all amino acid chains are part of</title>
      <p>proteins. Every instance of site of an amino acid
residue in a protein and its subclass site of
post translationally modified amino acid
residue is part_of some protein.</p>
      <p>Each instance of mouse histone H3.3 site is
part_of some amino acid chain9 and is the location
of some amino acid residue10.</p>
      <p>The subclasses of mouse histone H3.3 site,
shown in Figure 2 represent specific sites. For example,
mouse histone H3.3 Lys-19 site is a specific site.</p>
    </sec>
    <sec id="sec-25">
      <title>The label used for the term is mnemonic to suggest to a</title>
      <p>human reader that this is a site on a mouse histone H3.3,
that it contains a lysine residue, and that it is in a particular
location with respect to the N-terminus. However, the label
itself should not be taken as a representation of these facts.</p>
      <sec id="sec-25-1">
        <title>B. Residues</title>
      </sec>
    </sec>
    <sec id="sec-26">
      <title>Recall from Table 1 and earlier discussion of</title>
      <p>PR:000036802 (histone H3.3 acetylated and
methylated 2 (mouse)) that PSI-MOD IDs are
currently attached to the entry for that protein, and that the
PSI-MOD IDs seem to denote the processes of modification
rather than, say, the resulting residues. MOD:00085, for
instance, is defined as A protein modification that effectively
converts an L-lysine residue to N6-methyl-L-lysine. This text
definition makes reference to the residue that results from this
type of modification, but MOD:00085 itself stands for the
modification, not the modified residue.</p>
    </sec>
    <sec id="sec-27">
      <title>We are using RESID terms for amino acid residues, as</title>
      <p>shown in Figure 3. We have defined modified residue,
which is a subclass of the CHEBI[6] term amino acid
residue. Subclasses of modified residue are amino
acid residues that are the output of some modification process.</p>
    </sec>
    <sec id="sec-28">
      <title>A MOD:00085 (process) involves the conversion of an</title>
      <p>L-lysine residue (RESID:AA001211), which is located
at some site of an amino acid residue in
8http://purl.obolibrary.org/obo/PR 000000001
9http://purl.obolibrary.org/obo/PR 000018263
10http://purl.obolibrary.org/obo/CHEBI 33708
11http://pir.georgetown.edu/cgi-bin/resid?id=AA0012
a protein, into an N6-methyl-L-lysine residue
(RESID:AA0076 12) that is located at some site of
post translationally modified amino acid
residue.</p>
    </sec>
    <sec id="sec-29">
      <title>These facts are currently given in PSI-MOD as part</title>
      <p>of text definitions (e.g. A protein modification that
effectively converts an L-lysine residue to
N6-methyl-Llysine 13) and as unstructured xref_definitions:</p>
    </sec>
    <sec id="sec-30">
      <title>RESID:AA0076 - the modification’s output - is listed among five other xref_definitions in the PSI-MOD entry for</title>
      <p>MOD:00085.</p>
      <p>Any instance of site of an amino acid
residue in a protein is occupied_by some
amino acid residue.</p>
      <p>Any instance of site of post translationally
modified amino acid residue is occupied_by
some modified residue.</p>
      <sec id="sec-30-1">
        <title>C. Modified Proteins</title>
        <p>histone H3.3 acetylated and methylated</p>
      </sec>
    </sec>
    <sec id="sec-31">
      <title>2 (mouse) is that particular modified protein,</title>
      <p>which is a subclass of histone (mouse) that
has four sites of modified residue. One of
those modified residues is a mouse histone
H3.3 Lys-10 site that is occupied_by some
N6,N6,N6-trimethyl-L-lysine residue 14.</p>
      <p>Fig. 5. Definition of Histone H3.3 acetylated and methylated 2 (mouse)
12http://pir.georgetown.edu/cgi-bin/resid?id=AA0076
13http://www.ebi.ac.uk/ontology-lookup/?termId=MOD:00085
14http://pir.georgetown.edu/cgi-bin/resid?id=AA0074</p>
      <sec id="sec-31-1">
        <title>D. Position within a Reference Sequence</title>
      </sec>
    </sec>
    <sec id="sec-32">
      <title>The position within a reference sequence of a particular</title>
      <p>amino acid residue / modification site is a key piece of
information currently used to identify such sites in many
different resources. We represent such descriptions as
Information Artifact Ontology (IAO)15 information content
entitys. For instance, the Uniprot reference sequence
P8424416 corresponds to histone H3.3 (mouse). We
use the class site location within reference
sequence of uniprot P84244 for positions relative
to that sequence. Members of that class are named
individuals like position 19 in reference sequence
of mouse histone H3.3, which is about a specific
mouse histone H3.3 site, namely mouse histone
H3.3 Lys-19 site.</p>
    </sec>
    <sec id="sec-33">
      <title>CONCLUSIONS AND FUTURE WORK</title>
    </sec>
    <sec id="sec-34">
      <title>We have outlined and implemented a representation of</title>
      <p>posttranslational modifications that explicitly accounts for the
biological entities involved and information about them. This
includes representations of sites, residues, and modified
variants as well as of information artifacts such as descriptions of
sites relative to reference sequences.</p>
      <p>This representation scheme is ready to be applied to a
larger set of test data, some of which is not currently present
in PRO and some of which is currently present in a less
structured form. Immediate future work involves translating
histone modification data from the Histone Infobase and from
top-down proteomics, integrating those data with the relevant
records in PRO, and creating new records where needed. Much
of the work can be scripted, but some will require manual
curation as well.</p>
      <p>Other planned future work is to expand the scope of our
representation of sites to include other types of modifications,
mutation sites, cleavage sites, and so on. Though the basic
scheme will remain the same – using terms that represent
each type of site along with terms for related entities – each
will present opportunities for interesting ontology work. For
instance, mutations that involve the deletion of an amino acid
residue raise non-trivial questions about what happens to the
site when the residue that occupied it is gone.
V.</p>
    </sec>
    <sec id="sec-35">
      <title>EXAMPLE QUERY</title>
      <p>The following simple SPARQL query gets the subclasses
of histone H3.3 (mouse) – h33mouse – and their
sites that are subclasses of site of an amino acid
residue in a protein ressite: 17 That is: it
answers the question what are the variants of histone H3.3
(mouse) and the sites of their PTMs?
15https://code.google.com/p/information-artifact-ontology/
16http://www.uniprot.org/uniprot/P84244
17The URI used here for site of an amino acid residue in a
protein (http://purl.obolibrary.org/obo/PROXXX 0001001) is a temporary
placeholder to be replaced by a real PRO ID when it is added to a PRO release</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>