1. Introduction

Predicting missing annotations in Gene Ontology with Knowledge Graph Embeddings and True Path Rule

Özge Erten

o.erten@maastrichtuniversity.nl 1 2

Shervin Mehryar

shervin.mehryar@maastrichtuniversity.nl 1 2

Remzi Çelebi

remzi.celebi@maastrichtuniversity.nl 1 2

Christopher Brewster

christopher.brewster@maastrichtuniversity.nl 0 1 2 0 Data Science Group, TNO , Kampweg, Soesterberg , Netherlands 1 Institute of Data Science, Maastricht University , Paul-Henri Spaaklaan 1, 6229 GT, Maastricht , Netherlands 2 SWAT4HCLS 2023: The 14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences

Gene Ontology (GO) and its Annotations (GOA) provide a controlled and evolving vocabulary for gene products and gene functions widely used in molecular biology. GO & GOA are updated and maintained both automatically from biological publications and manually by curators. These knowledge bases however are often incomplete for two reasons: 1) Research in biological domain itself is still ongoing; 2) The amount of experimental evidence might not be yet suficient to validate annotations. In this paper, we address the gap in evidence between gene products and their annotations by making link predictions using Knowledge Graph Embedding (KGE) methods. Through the application of the True Path Rule (TPR) in the training stage of KGE, we were able to improve the performance of traditional KGE methods. We report two experimental scenarios with GO and GO Chicken Annotation datasets to show the contribution of embedding TPR to prediction accuracy.

eol>Link prediction True path rule Knowledge graph embeddings Predicting Gene Ontology Annotations

1. Introduction

for each GO class in the graph. The second step is that if the classifier assigns a positive label for a class, the parent classes also have that label, but negative labels do not propagate from the bottom up. The third step is that if it is labeled with a negative label for a class, it also assigns all of its child classes to negative labels. Positive labels do not afect the lower classes in the GO hierarchy. In the experiments, TPR based ensemble performs better than other ensemble algorithms.

Kulmanov et al.[ 6 ] describe how ontologies can be used to provide background knowledge in machine learning-based semantic similarity tasks. The distance similarity or the similarity of belonging to a particular subject between the elements in representative learning plays an important role in model training. To observe this similarity contribution, they evaluate various ontology embedding techniques. One of the experiments was done by adding GO semantics with TPR to two neural network-based methods, and the both experimental results show an increase in prediction scores.

In this work, we defined two experimental designs with a dataset that consists of GOA versions from 2018 to current (2022). The GOA versions are considered and treated as pairs. For the training set, both experiments use the earlier version in the pair as well as the subsumption classes in GO. The newer version is used for diferentiating comparison with the prior version and detecting newly added annotations. The testing and validation datasets in the first scenario take into account only those captured, newly added annotations. The second scenario adds implicit annotations that are captured in the GO hierarchy by TPR to the test and validation sets.

2. Methodology

KGEs are the vector embeddings learned from a set of triples describing facts in a KG. KGEs can subsequently be used to perform reasoning tasks such as link prediction and entity classification. Typically, KG embedding methods embeds entities and relations onto a vector space directly where each triple (head entity, relation, tail entity) in the KG is assigned a score based on its validity. The sum of scores (i.e. loss) for positive and negative triple set is optimized during training. In this paper, we applied the KGE methods to GO and its Annotations to predict missing or future annotations. To further capture and embed the TPR, we generate and incorporate samples using the TPR in the training data [ 7 ].

In detail, we formulate the task as follows: Given a KG represented with a relation between entities 1 and 2. First, the optimal vector embeddings are learned for all entities and relations. The corresponding vector space can also be denoted by the following relation: ⃗2 = ⃗1 + ⃗, and it represents single triples in the ontology in the form (1, , 2) ∈ . Then, the following criterion is used to learn the embeddings for link prediction with given a set of triples representing facts in the KG: ||⃗ + ⃗ − ⃗||2, (1) tri =

∑︁ (,,)∈ where as before, ⃗, ⃗, and ⃗ are vector representations in R corresponding to head entity, relation, and tail entity in the ontology. These representations are learned using the triples in the data and embedded as a -dimensional vector similar to the process in TransE [ 8 ]. Our contribution is adding samples following the TPR as shown in Figure 1 to the set . Essentially, we distinguish between direct gene product-and-function relations and higher level gene product-and-function relations. In the first scenario, embeddings are learned using the TransE model on existing triplets in the dataset. In the second scenario, we enrich the training data with additional samples from gene products inheriting their first-level ancestry functions as well as second-level ancestry functions. These additional samples serve to improve embedding qualities and we refer to this method as TransE+TPR. The detail on dataset creation and the diferent scenarios are given in the next section, and the code is accessible on our Github: https://github.com/ozyygen/predict-KGE-TPR.

3. Dataset

We generate four datasets by using GOA versions from 2018 to current (2022). Each dataset contains a version pair that is selected with one year window length. We use the prior version to generate the training set, and the latter for generating the testing and validation sets. Namely, each set contains triples consist of a gene product as head, a gene function as tail, and the type in which the gene function annotates that gene product as relation. Additionally, we add "is a" and "part of" semantic of gene functions and TPR-inferred annotations from GO into the training set. In Table 1, triple counts for each dataset are given for training, testing and validation sets.

4. Experimental Design

In this work, two experimental designs were studied for KGE methods across four versions and three scenarios. Figure 1 shows training, testing and validation split on a toy data. Node 8, 9,10, 11, representing gene functions, annotate X1, X2, X3 and X4 gene products respectively in the figure. Solid red line denotes the first version relations between gene products and gene functions, and dashed red lines represent relations for the second version of the dataset [ 2 ].

Scenario 1: Two consecutive versions of GOA were used to generate the dataset for training the embedding model. Specifically, the prior version was used to generate the training set. We also added the related GO subsumption classes and TPR-inferred annotations in the training set in order to enrich semantic information in the KG. The test and validation sets were created with the latter version of the pair by randomly splitting the triples (annotations) into a test set and a validation set by a ratio of 0.5. We excluded the triples that contain a new gene product or a new gene function which were not present in the prior version. This scenario is denoted by sc-1.

Scenario 2: In Scenario 2, we used the same training set with Scenario 1, but we extended Scenario 1 test set with the implicit relations obtained from the TPR. We added relations that can be infered by TPR to the test set. We infer these relations by applying the following rule; if a gene product is annotated by a gene function in the training set, then the gene product need to be annotated by the ancestral classes of that gene function. The objective of this addition is to observe whether our method can predict the implicit links inferred by the TPR.

To observe the efect of the TPR at the diferent level depth, we designed Scenario 2.1 and Scenario 2.2 with two diferent super classes depth: • Scenario 2.1: In this scenario, we generated implicit annotations with TPR using the first level ancestors of gene functions. This scenario is denoted by sc-2.1. • Scenario 2.2: For this scenario, we considered second level ancestors of gene functions in addition to the first level ancestors. This scenario is denoted by sc-2.2.

5. Result and Conclusion

We conducted experiments with TransE and TransE+TPR to find out the eficacy of TPR into the link prediction accuracy. The results are shown in Table 2.

We repeated the experiments with diferent time windows and scenarios. Four datasets help to observe whether the train, test and validation set triple count has an efect on predictions. Scenario 1 does not have any inferred annotations. For Scenario 2, we added TPR-inferred GO semantic to compare the scores with scenario-1. The table has two methods for four dataset with three scenarios. Accordingly, TransE Hit@10 scores show hierarchical semantic addition in dataset does not have a significant impact on improving prediction accuracy. On the contrary in TransE+TPR method, the rule contribute increasing the accuracy.

Specifically, we implement TransE and TPR and evaluate on diferent GOA datasets. The dataset is enriched with TPR-inferred annotations and GO subsumption classes. The results show significant increase in the accuracy when rules are applied during training process. Particularly, in the best case scenario the proposed method performance in terms of Hits@10 is at average 0.6275 ± 0.0814 in comparison to average Hits@10 of 0.1037 ± 0.0094 from TransE. This approximately 0.52 gain in performance is attributed to the importance of hierarchical information captured by the model through TPR samples, as explained in the previous section.

Even though TransE+TPR method achieved the highest accuracy scores for Scenario 2 almost each dataset, further study requires to determine the optimal depth for hierarchical class addition of GO semantic to receive the best prediction accuracy. Also, distinguishing annotations based on evidence, such as types of experiments or automatically generated, then treating them accordingly might have an impact on prediction accuracy. Furthermore, we think that training a KGE with several versions of the data will enhance the efectiveness of the KGE in link prediction. Lastly, the training-test split, which takes into account gene traits such orthology, can be used to test link prediction stability. We leave these topics open to be covered in future work.

[1]

J. A.

Blake , Ten quick tips for using the gene ontology , PLoS computational biology 9 ( 2013 ) e1003343 .

[2]

G. O.

Consortium , The gene ontology resource: 20 years and still going strong , Nucleic acids research 47 ( 2019 ) D330 - D338 .

[3]

Wang ,

Qiu ,

Wang , A survey on knowledge graph embeddings for link prediction , Symmetry 13 ( 2021 ) 485 .

[4]

Zhao ,

Wang ,

Chen ,

Zhang , M. Guo,

Yu , A literature review of gene function prediction by modeling gene ontology , Frontiers in Genetics 11 ( 2020 ) 400 .

[5]

Valentini , True path rule hierarchical ensembles , in: International Workshop on Multiple Classifier Systems , Springer, 2009 , pp. 232 - 241 .

[6]

Kulmanov ,

F. Z.

Smaili ,

Gao ,

Hoehndorf , Semantic similarity and machine learning with ontologies , Briefings in bioinformatics 22 ( 2021 ) bbaa199 .

[7]

Dai ,

Wang ,

N. N.

Xiong ,

Guo , A survey on knowledge graph embedding: Approaches, applications and benchmarks, Electronics 9 ( 2020 ) 750 .

[8]

Bordes ,

Usunier ,

Garcia-Duran ,

Weston ,

Yakhnenko , Translating embeddings for modeling multi-relational data , Advances in neural information processing systems 26 ( 2013 ).