<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Protein Annotation Framework Empowered with Semantic Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jemma X. Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edmond J. Breen</string-name>
          <email>ebreen@proteome.org.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaomin Song</string-name>
          <email>xsong@proteome.org.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brett Cooke</string-name>
          <email>bcooke@proteome.org.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark P. Molloy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Australian Proteome Analysis Facility, Maquarie University</institution>
          ,
          <addr-line>Sydney</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an association discovery framework for proteins based on semantic annotations from biomedical literatures. An automatic ontology-based annotation method is used to create a semantic protein annotation knowledge base. A semantic reasoning service enables realisation reasoning on original annotations to infer more accurate associations. A case study on protein-disease association discovery on a real-world colorectal cancer dataset is presented.</p>
      </abstract>
      <kwd-group>
        <kwd>Protein annotation</kwd>
        <kwd>bioinformatics</kwd>
        <kwd>semantic reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        To bridge the gap between the biomedical science and bioinformatics, many
biomedical ontologies have been created in the past few years. Ontology-based
semantic annotation for biomedical entities are of interest to both biomedical
researchers and general public. Meanwhile, the biomedical domain has a large
and fast-growing amount of literature resources, among which MedLine1 is the
primary publication repository for biomedical research. Ontology-based
biomedical text annotation has shown promising progress and several tools have been
successfully developed and evaluated in biomedical text mining problems[
        <xref ref-type="bibr" rid="ref2 ref4 ref5">2, 5, 4</xref>
        ].
However, these generic text-based biomedical annotation tools only provide
concept level annotations. The ability to do protein-oriented semantic annotation
will greatly bene t the proteomics research by enabling easy protein association
discovery. Also, traditional text-based annotation tools tend to create excessive
annotations and some tools expand the raw annotations by using semantic
reasoning[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Inferring of the most informative and accurate annotations will be
very valuable to e cient and accurate association discovery.
      </p>
      <p>This paper proposes an integrated high performance framework that
leveraging protein annotations and semantic reasoning to an informative
proteinbiomedical concepts association Knowledge Base(KB). Starting from a list of
proteins, the system automatically retrieves a pool of MedLine citations and
annotates the proteins using pre-de ned biomedical ontologies. A realisation
reasoning service is applied to infer more accurate protein association information.
In our preliminary study, the focus is on the discovery of potential
proteindisease associations. A case study on discovering protein-disease associations for
a real-world colorectal cancer tissue protein dataset is presented.</p>
      <sec id="sec-1-1">
        <title>1 http://www.ncbi.nlm.nih.gov/pubmed</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>SPRAM: A Semantic PRotein Annotation framework based on MedLine</title>
      <p>We propose a Semantic PRotein Annotation method based on MedLine (SPRAM)
which produces semantically inferred protein annotations based on biomedical
literatures and ontologies (Fig.1). SPRAM starts with a list of proteins of
in</p>
    </sec>
    <sec id="sec-3">
      <title>Semantic reasoning for protein annotations</title>
      <p>
        In biomedical ontology annotations, very often an instance is annotated with
multiple classes with subclass relationships in the ontology. To the best of our
knowledge, existing biomedical annotation tools with semantic reasoning
functionalities only do semantic expanding[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There has been no prior work on
drilling down the annotations to most speci c concepts by using the
semantic reasoning. Despite, in many cases, the most speci c classes of a protein, can
more accurately represent their biomedical categorical information. For example,
the traditional protein Gene Ontology analysis that shows the distribution of
biological process or molecular functions nearly always bias towards the top-level
classes in the ontology[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>2 http://www.uniprot.org/ 3 http://www.ncbi.nlm.nih.gov/gene 4 http://disease-ontology.org/ 5 http://www.bioontology.org/wiki/index.php/Annotator Web service</title>
        <p>We developed a specialised realisation reasoning service for dynamically
generated annotations. Di erent to the traditional Description Logic most speci c
concept reasoning, our algorithm works on a dynamic set of annotations on the
y instead of assertions in a static KB. Only the most speci c annotations will be
stored in the KB. The algorithm takes a set of semantic protein annotations, ,
and an ontology, O, that is based on. A most speci c class set, 0 , is initialised
to be an empty set, ;. For each class t 2 for each protein, nd all subclasses of
t in O, i.e., fCig where Ci v t 2 O. Class t is added to 0 if fCig \ = ;, i.e.,
t is the most speci c annotation in given ontology O. The algorithm outputs
the most speci c class set 0 which will be inserted into the nascent KB.</p>
        <p>Fig.2 shows an example of the e ect of applying realisation reasoning on the
disease annotations for a protein with a UniProt ID \O43175". Class \disease",
\cancer" and \carcinoma" in the original annotation set are all realised to
\adenocarinoma" because the last concept is subsumed by those three concepts and
it is also in the original annotation set. This is important because it more
accurately represents the biomedical categorical information and reduces complexity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Case study: discovery of proteins-disease associations for colorectal cancer tissues</title>
      <p>
        Jankova et al. found 45 up-regulated proteins in colorectal cancer tissues by using
experimental protein iTRAQ analysis[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The biologists would like to know what
diseases are related to these proteins and if the associations to the colorectal
cancer have been discovered before.
      </p>
      <p>To help biologists achieve these goals, we take these 45 proteins and use our
SPRAM work ow to assist discovering the potential diseases associated with
these proteins. The result was: 1080 MedLine citations, 354 diseases associations
based on HDO, that was reduced to 241 unique associations after realisation
reasoning. SPRAM returns a set of protein-disease association to the biologists.
That includes also the source reference titles and URLs. The biologist can then
use this result to help validate these associations easily by tracing back to the
references.</p>
      <p>Fig.3 shows the changes of the distribution of the diseases associated with
these 45 proteins before and after realisation reasoning. The distribution after
realisation reasoning represents more accurate and sensible information. For
example, the top distributed concept, disease, was removed and the next, cancer,
was greatly reduced, thereby producing a clearer set of associations.
(a) DOA distribution before reasoning
(b) DOA distribution after reasoning</p>
      <p>To nd proteins reported as colorectal cancer related, the biologist issues
a query using the concept, \colorectal cancer". The semantic reasoning service
rewrites this query into a union of this concept and all of its subclasses. The
result shows that 6 proteins (CEA, NNE, HSP 84, NPM, 3-PGDH and UEV-1)
reported in the literature as being related to colorectal cancer.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper proposes an automatic protein-oriented association discovery
framework based on semantic annotations from literature. A semantic reasoning
service provides realisation reasoning. We demonstrate the usage of our system on
protein-disease association discovery using a real-world colorectal cancer protein
dataset. In upcoming work, focus will be given to a ranking model of protein
associations and customisable selection of protein-PMID mappings.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>L.</given-names>
            <surname>Jankova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cowley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dent</surname>
          </string-name>
          , E. Bokey,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chapuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baker</surname>
          </string-name>
          , G. Robertson,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.P.</given-names>
            <surname>Molloy</surname>
          </string-name>
          .
          <article-title>Proteomic comparison of colorectal tumours and non-neoplastic mucosa from paired patient samples using iTRAQ mass spectrometry</article-title>
          .
          <source>Mol Biosyst</source>
          ,
          <volume>7</volume>
          (
          <issue>11</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Jelier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuemie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veldhoven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dorssers</surname>
          </string-name>
          , G. Jenster, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kors</surname>
          </string-name>
          .
          <article-title>Anni 2.0: a multipurpose text-mining tool for the life sciences</article-title>
          .
          <source>Gen biology</source>
          ,
          <volume>9</volume>
          (
          <issue>6</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Clement</given-names>
            <surname>Jonquet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nigam H. Shah</surname>
            , and
            <given-names>Mark A.</given-names>
          </string-name>
          <string-name>
            <surname>Musen</surname>
          </string-name>
          .
          <article-title>The open biomedical annotator</article-title>
          .
          <source>In AMIA-TBI'09</source>
          , pages
          <fpage>56</fpage>
          {
          <fpage>60</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>H.</given-names>
            <surname>Lpez-Fernndez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reboiro-Jato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Glez-Pea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Aparicio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gachet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buenaga</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Fdez-Riverola</surname>
          </string-name>
          .
          <article-title>Bioannote: A software platform for annotating biomedical documents with application in medical learning environments</article-title>
          .
          <source>Computer Methods</source>
          and Programs in Biomedicine,
          <volume>111</volume>
          (
          <issue>1</issue>
          ):
          <volume>139</volume>
          {
          <fpage>147</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Mariana</given-names>
            <surname>Neves</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ulf</given-names>
            <surname>Leser</surname>
          </string-name>
          .
          <article-title>A survey on annotation tools for the biomedical literature</article-title>
          .
          <source>Brief Bioinform</source>
          ,
          <volume>18</volume>
          ,
          <year>December 2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>