<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting missing annotations in Gene Ontology with Knowledge Graph Embeddings and True Path Rule</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Özge Erten</string-name>
          <email>o.erten@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shervin Mehryar</string-name>
          <email>shervin.mehryar@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Remzi Çelebi</string-name>
          <email>remzi.celebi@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Brewster</string-name>
          <email>christopher.brewster@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Group, TNO</institution>
          ,
          <addr-line>Kampweg, Soesterberg</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Data Science, Maastricht University</institution>
          ,
          <addr-line>Paul-Henri Spaaklaan 1, 6229 GT, Maastricht</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SWAT4HCLS 2023: The 14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Gene Ontology (GO) and its Annotations (GOA) provide a controlled and evolving vocabulary for gene products and gene functions widely used in molecular biology. GO &amp; GOA are updated and maintained both automatically from biological publications and manually by curators. These knowledge bases however are often incomplete for two reasons: 1) Research in biological domain itself is still ongoing; 2) The amount of experimental evidence might not be yet suficient to validate annotations. In this paper, we address the gap in evidence between gene products and their annotations by making link predictions using Knowledge Graph Embedding (KGE) methods. Through the application of the True Path Rule (TPR) in the training stage of KGE, we were able to improve the performance of traditional KGE methods. We report two experimental scenarios with GO and GO Chicken Annotation datasets to show the contribution of embedding TPR to prediction accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Link prediction</kwd>
        <kwd>True path rule</kwd>
        <kwd>Knowledge graph embeddings</kwd>
        <kwd>Predicting Gene Ontology Annotations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>for each GO class in the graph. The second step is that if the classifier assigns a positive label for a
class, the parent classes also have that label, but negative labels do not propagate from the bottom
up. The third step is that if it is labeled with a negative label for a class, it also assigns all of its child
classes to negative labels. Positive labels do not afect the lower classes in the GO hierarchy. In the
experiments, TPR based ensemble performs better than other ensemble algorithms.</p>
      <p>
        Kulmanov et al.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] describe how ontologies can be used to provide background knowledge in
machine learning-based semantic similarity tasks. The distance similarity or the similarity of belonging
to a particular subject between the elements in representative learning plays an important role in
model training. To observe this similarity contribution, they evaluate various ontology embedding
techniques. One of the experiments was done by adding GO semantics with TPR to two neural
network-based methods, and the both experimental results show an increase in prediction scores.
      </p>
      <p>In this work, we defined two experimental designs with a dataset that consists of GOA versions
from 2018 to current (2022). The GOA versions are considered and treated as pairs. For the training
set, both experiments use the earlier version in the pair as well as the subsumption classes in GO.
The newer version is used for diferentiating comparison with the prior version and detecting newly
added annotations. The testing and validation datasets in the first scenario take into account only
those captured, newly added annotations. The second scenario adds implicit annotations that are
captured in the GO hierarchy by TPR to the test and validation sets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        KGEs are the vector embeddings learned from a set of triples describing facts in a KG. KGEs can
subsequently be used to perform reasoning tasks such as link prediction and entity classification.
Typically, KG embedding methods embeds entities and relations onto a vector space directly where
each triple (head entity, relation, tail entity) in the KG is assigned a score based on its validity. The
sum of scores (i.e. loss) for positive and negative triple set is optimized during training. In this paper,
we applied the KGE methods to GO and its Annotations to predict missing or future annotations.
To further capture and embed the TPR, we generate and incorporate samples using the TPR in the
training data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>In detail, we formulate the task as follows: Given a KG  represented with a relation  between
entities 1 and 2. First, the optimal vector embeddings are learned for all entities and relations.
The corresponding vector space can also be denoted by the following relation: ⃗2 = ⃗1 + ⃗, and it
represents single triples in the ontology in the form (1, , 2) ∈ . Then, the following criterion is
used to learn the embeddings for link prediction with given a set  of triples representing facts in the
KG:
||⃗ + ⃗ − ⃗||2,
(1)
tri =</p>
      <p>
        ∑︁
(,,)∈
where as before, ⃗, ⃗, and ⃗ are vector representations in R corresponding to head entity, relation,
and tail entity in the ontology. These representations are learned using the triples in the data and
embedded as a -dimensional vector similar to the process in TransE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our contribution is adding
samples following the TPR as shown in Figure 1 to the set . Essentially, we distinguish between direct
gene product-and-function relations and higher level gene product-and-function relations. In the first
scenario, embeddings are learned using the TransE model on existing triplets in the dataset. In the
second scenario, we enrich the training data with additional samples from gene products inheriting
their first-level ancestry functions as well as second-level ancestry functions. These additional samples
serve to improve embedding qualities and we refer to this method as TransE+TPR. The detail on
dataset creation and the diferent scenarios are given in the next section, and the code is accessible
on our Github: https://github.com/ozyygen/predict-KGE-TPR.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>We generate four datasets by using GOA versions from 2018 to current (2022). Each dataset contains
a version pair that is selected with one year window length. We use the prior version to generate the
training set, and the latter for generating the testing and validation sets. Namely, each set contains
triples consist of a gene product as head, a gene function as tail, and the type in which the gene
function annotates that gene product as relation. Additionally, we add "is a" and "part of" semantic of
gene functions and TPR-inferred annotations from GO into the training set. In Table 1, triple counts
for each dataset are given for training, testing and validation sets.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Design</title>
      <p>
        In this work, two experimental designs were studied for KGE methods across four versions and
three scenarios. Figure 1 shows training, testing and validation split on a toy data. Node 8, 9,10, 11,
representing gene functions, annotate X1, X2, X3 and X4 gene products respectively in the figure.
Solid red line denotes the first version relations between gene products and gene functions, and
dashed red lines represent relations for the second version of the dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Scenario 1: Two consecutive versions of GOA were used to generate the dataset for training the
embedding model. Specifically, the prior version was used to generate the training set. We also
added the related GO subsumption classes and TPR-inferred annotations in the training set in order
to enrich semantic information in the KG. The test and validation sets were created with the latter
version of the pair by randomly splitting the triples (annotations) into a test set and a validation set
by a ratio of 0.5. We excluded the triples that contain a new gene product or a new gene function
which were not present in the prior version. This scenario is denoted by sc-1.</p>
      <p>Scenario 2: In Scenario 2, we used the same training set with Scenario 1, but we extended Scenario
1 test set with the implicit relations obtained from the TPR. We added relations that can be infered
by TPR to the test set. We infer these relations by applying the following rule; if a gene product
is annotated by a gene function in the training set, then the gene product need to be annotated by
the ancestral classes of that gene function. The objective of this addition is to observe whether our
method can predict the implicit links inferred by the TPR.</p>
      <p>To observe the efect of the TPR at the diferent level depth, we designed Scenario 2.1 and Scenario
2.2 with two diferent super classes depth:
• Scenario 2.1: In this scenario, we generated implicit annotations with TPR using the first
level ancestors of gene functions. This scenario is denoted by sc-2.1.
• Scenario 2.2: For this scenario, we considered second level ancestors of gene functions in
addition to the first level ancestors. This scenario is denoted by sc-2.2.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Result and Conclusion</title>
      <p>We conducted experiments with TransE and TransE+TPR to find out the eficacy of TPR into the link
prediction accuracy. The results are shown in Table 2.</p>
      <p>We repeated the experiments with diferent time windows and scenarios. Four datasets help to
observe whether the train, test and validation set triple count has an efect on predictions. Scenario
1 does not have any inferred annotations. For Scenario 2, we added TPR-inferred GO semantic to
compare the scores with scenario-1. The table has two methods for four dataset with three scenarios.
Accordingly, TransE Hit@10 scores show hierarchical semantic addition in dataset does not have a
significant impact on improving prediction accuracy. On the contrary in TransE+TPR method, the
rule contribute increasing the accuracy.</p>
      <p>Specifically, we implement TransE and TPR and evaluate on diferent GOA datasets. The dataset is
enriched with TPR-inferred annotations and GO subsumption classes. The results show significant
increase in the accuracy when rules are applied during training process. Particularly, in the best
case scenario the proposed method performance in terms of Hits@10 is at average 0.6275 ± 0.0814
in comparison to average Hits@10 of 0.1037 ± 0.0094 from TransE. This approximately 0.52 gain
in performance is attributed to the importance of hierarchical information captured by the model
through TPR samples, as explained in the previous section.</p>
      <p>Even though TransE+TPR method achieved the highest accuracy scores for Scenario 2 almost each
dataset, further study requires to determine the optimal depth for hierarchical class addition of GO
semantic to receive the best prediction accuracy. Also, distinguishing annotations based on evidence,
such as types of experiments or automatically generated, then treating them accordingly might have
an impact on prediction accuracy. Furthermore, we think that training a KGE with several versions
of the data will enhance the efectiveness of the KGE in link prediction. Lastly, the training-test split,
which takes into account gene traits such orthology, can be used to test link prediction stability. We
leave these topics open to be covered in future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <article-title>Ten quick tips for using the gene ontology</article-title>
          ,
          <source>PLoS computational biology 9</source>
          (
          <year>2013</year>
          )
          <article-title>e1003343</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. O.</given-names>
            <surname>Consortium</surname>
          </string-name>
          ,
          <article-title>The gene ontology resource: 20 years and still going strong</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>47</volume>
          (
          <year>2019</year>
          )
          <fpage>D330</fpage>
          -
          <lpage>D338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge graph embeddings for link prediction</article-title>
          ,
          <source>Symmetry</source>
          <volume>13</volume>
          (
          <year>2021</year>
          )
          <fpage>485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Guo,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A literature review of gene function prediction by modeling gene ontology</article-title>
          ,
          <source>Frontiers in Genetics</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>400</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Valentini</surname>
          </string-name>
          ,
          <article-title>True path rule hierarchical ensembles</article-title>
          ,
          <source>in: International Workshop on Multiple Classifier Systems</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kulmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. Z.</given-names>
            <surname>Smaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          ,
          <article-title>Semantic similarity and machine learning with ontologies</article-title>
          ,
          <source>Briefings in bioinformatics 22</source>
          (
          <year>2021</year>
          )
          <article-title>bbaa199</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge graph embedding: Approaches, applications</article-title>
          and benchmarks,
          <source>Electronics</source>
          <volume>9</volume>
          (
          <year>2020</year>
          )
          <fpage>750</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>