=Paper=
{{Paper
|id=Vol-2456/paper64
|storemode=property
|title=Predicting Missing Links Using PyKEEN
|pdfUrl=https://ceur-ws.org/Vol-2456/paper64.pdf
|volume=Vol-2456
|authors=Mehdi Ali,Charles Tapley Hoyt,Daniel Domingo-Fernandez,Jens Lehmann
|dblpUrl=https://dblp.org/rec/conf/semweb/AliHD019
}}
==Predicting Missing Links Using PyKEEN==
<pdf width="1500px">https://ceur-ws.org/Vol-2456/paper64.pdf</pdf>
<pre>
        Predicting Missing Links Using PyKEEN

    Mehdi Ali1,2 , Charles Tapley Hoyt3 , Daniel Domingo-Fernández3 , and Jens
                                   Lehmann1,2

           Smart Data Analytics Group, University of Bonn, Germany
                  {mehdi.ali,jens.lehmann}@cs.uni-bonn.de
 Department of Enterprise Information Systems, Fraunhofer Institute for Intelligent
    Analysis and Information Systems, Sankt Augustin and Dresden, Germany
                       jens.lehmann@iais.fraunhofer.de
  Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific
                      Computing, Sankt Augustin, Germany
        {charles.hoyt,daniel.domingo.fernandez}@scai.fraunhofer.de


        Abstract. PyKEEN is a framework, which integrates several approaches
        to compute knowledge graph embeddings (KGEs). We demonstrate the
        usage of PyKEEN in an biomedical use case, i.e. we trained and eval-
        uated several KGE models on a biological knowledge graph containing
        genes’ annotations to pathways and pathway hierarchies from well-known
        databases. We used the best performing model to predict new links and
        present an evaluation in collaboration with a domain expert? .

        Keywords: Machine learning · Knowledge Graphs · Bioinformatics ·
        Link Prediction


1     Introduction

Knowledge graphs (KGs) have been adopted by various research fields (e.g.,
Semantic Web, bioinformatics) to represent factual information. Examples of
KGs are DBpedia [7], Wikidata [13], and the Bio2RDF [2] repository. Although
existing KGs may contain billions of links, they are usually incomplete (i.e.,
missing links) [9]. Knowledge graph embeddings (KGEs), which learn latent
vector representations for entities and relations in KGs while best preserving
their structural characteristics, provide one avenue for predicting these missing
links.
    Because the software ecosystem for KGEs remains limited, we have developed
the KEEN Universe [1] for training, evaluating, and sharing KGEs with a strong
focus on reproducibility and transferability. It currently comprises the Python
packages: PyKEEN (Python KnowlEdge EmbeddiNgs), BioKEEN (Biological
KnowlEdge EmbeddiNgs), and the KEEN Model Zoo for sharing experimental
artifacts.
?
    Copyright 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2      Ali et al.

    In this demonstration paper, we present a link prediction use case from the
biomedical domain that accompanies our resource paper at the ISWC 2019 Con-
ference [1]. In particular, we focus on the use of PyKEEN in predicting missing
links between genes and biological pathways as well as their internal hierarchies.


2   PyKEEN

PyKEEN provides the functionalities to train and to evaluate KGEs, and it
provides an inference workflow that assists users to predict novel links. PyKEEN
consists of two layers: the configuration layer and the learning layer.

Configuration Layer The configuration layer assists users in specifying their
KGE experiments’ datasets, models, hyper-parameters, training procedures, and
evaluation procedures. Experiments can either be defined programmatically or
by using the interactive command line interface (CLI) via a terminal. The CLI
ensures that experiments are configured correctly, and in case users provide an
invalid input, the CLI informs the users and provides an example of a correct
input.

Learning Layer The learning layer trains a model in training mode based on
a user defined set of hyper-parameters or finds suitable hyper-parameter values
in hyper-parameter optimization (HPO) mode.

Inference Workflow The inference workflow generates for a user defined set
of entities and relations all possible triple permutations (users can specify to
exclude reflexive triples of the form (e,r,e)). The inference workflow exports a
file containing the triples and their predicted scores where the predictions are
sorted according to their scores.


3   Application

Biological pathway databases have been generated and used in the classical anal-
ysis of -omics data, but their formulations as KGs are not amenable to classical
machine learning approaches for classification, clustering, or predictive model-
ing. Here, we trained and evaluated three KGE models (i.e., TransE, TransR
and ComplEx) before selecting the best performer for prediction of novel roles
of genes in pathways and evaluation by a domain expert.
    Training was conducted using four datasets: three that comprise links be-
tween genes and pathways from disparate pathway databases (i.e., KEGG [6],
Reactome [5], and WikiPathways [11]) and one (i.e., ComPath [4]) that com-
prises manually curated links between pathways from the previously mentioned
resources. By predicting links in the resulting merged KG, we identified and
hypothesized the role of genes in novel pathways.
                                  Predicting Missing Links Using PyKEEN         3

Experimental Setup For each experiment, we split the KG into a training and
test set then performed HPO for the TransE [3], TransR [8], and ComplEx [12]
models. The results were evaluated by mean rank and hits@k and presented
in Table 1. Afterwards, we focused on a set of entities and the relation partOf
and considered triples of the form (gene, partOf, pathway) to predict novel links
using PyKEEN’s inference workflow that were evaluated by a domain expert.


                      Model        Mean Rank Hits@10
                      TransE [3]     1069.21    13.88%
                      TransR [8]      376.38    24.83%
                      ComplEx [12]   193.98     59.50%
                            Table 1. HPO results.


    While ComplEx performed well, the poor performance of TransE may be due
its poor abilities to handle the high cardinality (N-M) relations in the data. Due
to time constraints, only 5 iterations of HPO were performed for TransR, so its
results may also be improved.

Results Due to the large number of predictions made by the KGE model, we
focus on the top five predictions between genes and pathways in Table 2. By
looking at these highly plausible links, we can not only identify novel roles of
genes in pathways, but also hypothesize the role of pathways in diseases that
has been linked to a given gene.


 Gene       Database     Pathway                                            Score
 RXRA WikiPathways Nuclear Receptors in Lipid Metabolism and Toxicity 16.20
 UPP1       KEGG         Pyrimidine metabolism                               16.01
 EZR        KEGG         Shigellosis                                         13.63
 UGT1A1 KEGG             Porphyrin and chlorophyll metabolism                13.41
 BLM        WikiPathways DNA IR-damage and cellular response via ATR         12.53
Table 2. Top ten predicted gene-pathway links in which higher scores indicate more
plausible links.


    The two most confidence predictions suggest that RXRA and UPP1 play
a role in Nuclear Receptors in Lipid Metabolism and Toxicity and Pyrimidine
metabolism pathways, respectively. A survey of the recent biomedical literature
suggests that RXRA is involved in lipid metabolism and UPP1 in the degrada-
tion and salvage of pyrimidine ribonucleosides. Interestingly, the third predicted
link that suggests the involvement of EZR, a cytoplasmic peripheral gene, in the
disease Shigellosis has been previously described by [10] in which they impli-
cated the gene in the process of Shigella bacterial uptake. Ultimately, it could
4       Ali et al.

be interesting to investigate other possible links connecting genes to pathway
implicated in diseases.

4    Conclusion
We have demonstrated the usage of PyKEEN in predicting missing links in KGs
from the biomedical domain. In particular, we performed HPO for three KGE
models (i.e., TransE, TransR, and ComplEx) and selected the best performing
to provide predictions to a domain expert which manually evaluated the top
ranked predictions. Finally, that this workflow that can be applied in any domain
highlights the effectiveness of PyKEEN to discover novel knowledge.

Acknowledgement This work was supported by the German national funded
BmBF project MLwin.

References
 1. Ali, M., et al.: The keen universe: An ecosystem for knowledge graph embeddings
    with a focus on reproducibility and transferability. In: International Semantic Web
    Conference (2019)
 2. Belleau, F., et al.: Bio2rdf: towards a mashup to build bioinformatics knowledge
    systems. Journal of biomedical informatics 41(5), 706–716 (2008)
 3. Bordes, A., et al.: Translating embeddings for modeling multi-relational data. In:
    Advances in neural information processing systems. pp. 2787–2795 (2013)
 4. Domingo-Fernandez, D., et al.: ComPath: an ecosystem for exploring, analyzing,
    and curating mappings across pathway databases. npj Systems Biology and Ap-
    plications 4(44) (2018). https://doi.org/10.1038/s41540-018-0078-8
 5. Fabregat, A., et al.: The Reactome Pathway Knowledgebase. Nucleic Acids Re-
    search 46(D1), D649–D655 (jan 2018). https://doi.org/10.1093/nar/gkx1132
 6. Kanehisa, M., et al.: Kegg: new perspectives on genomes, pathways, diseases and
    drugs. Nucleic acids research 45(D1), D353–D361 (2017)
 7. Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted
    from wikipedia. Semantic Web Journal 6(2), 167–195 (2015), outstanding Paper
    Award (Best 2014 SWJ Paper)
 8. Lin, Y., et al.: Learning entity and relation embeddings for knowledge graph com-
    pletion. In: Twenty-ninth AAAI conference on artificial intelligence (2015)
 9. Nickel, M., Murphy, et al.: A review of relational machine learning for knowledge
    graphs. Proceedings of the IEEE 104(1), 11–33 (2015)
10. Skoudy, A., et al.: A functional role for ezrin during shigella flexneri en-
    try into epithelial cells. Journal of Cell Science 112(13), 2059–2068 (1999),
    https://jcs.biologists.org/content/112/13/2059
11. Slenter, D., et al.: WikiPathways: a multifaceted pathway database bridging
    metabolomics to other omics research. Nucleic acids research 46(D1), D661–D667
    (jan 2018). https://doi.org/10.1093/nar/gkx1064
12. Trouillon, T., et al.: Complex embeddings for simple link prediction. In: Interna-
    tional Conference on Machine Learning. pp. 2071–2080 (2016)
13. Vrandečić, D., et al.: Wikidata: a free collaborative knowledgebase. Communica-
    tions of the ACM 57(10), 78–85 (2014)

</pre>