Knowledge Graph Embedding for Ecotoxicological Effect Prediction? Erik B. Myklebust1,2 , Ernesto Jimenez-Ruiz2,3 , Jiaoyan Chen4 , Raoul Wolf1 , and Knut Erik Tollefsen1 1 Norwegian Institute for Water Research (NIVA), Oslo, Norway 2 Department of Informatics, University of Oslo, Oslo, Norway 3 Alan Turing Institute, London, United Kingdom 4 Department of Computer Science, University of Oxford, United Kingdom Abstract. Exploring the effects of a chemical compound on a species takes a considerable experimental effort. Appropriate methods for estimating and sug- gesting new effects can dramatically reduce the work needed to be done by a laboratory. Here, we explore the suitability of using a knowledge graph em- bedding approach for ecotoxicological effect prediction. A knowledge graph has been constructed from publicly available data sets, including a species taxonomy and chemical knowledge. These knowledge sources are integrated by ontology alignment techniques. Our experimental results show that the knowledge graph and its embeddings augment the baseline models.1 1 Introduction It takes immense experimental efforts to determine ecotoxicological effects a chemi- cal compound has on a species. These effect data is available for a narrow range of compound-species pairs and a limited number of experimental test. Here, we present a preliminary study of the benefits of using Semantic Web tools to integrate different data sources and knowledge graph (KG) approaches to improve the ecotoxicological effect prediction over a baseline. Hence, our contribution is twofold: (i) We have created a KG by gathering and integrating the relevant data from dis- parate sources. In order to discover equivalent entities we exploit internal re- sources, external resources (e.g., Wikidata [16]) and ontology alignment (e.g., LogMap [6, 5]). (ii) We have evaluated three KG embedding approaches (TransE [2], DistMult [18] and HolE [12]) together with a baseline based on a one-hot encoding. Out eval- uation shows improvement in the metrics using KG embedding for a majority of the selected classification models. Note that, recall is preferred over precision, i.e., rather overestimate the effect of a chemical compound, than underestimate its hazardousness. ? Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 1 This paper is a short version of our ISWC 2019 In-use paper [11]. 2 E. B. Myklebust et. al. 2 Preliminaries Knowledge graphs. We follow the RDF-based notion of KGs [1] which are composed by RDF triples hs, p, oi, where s represents a subject (a class or an instance), p repre- sents a predicate (a property) and o represents an object (a class, an instance or a data value e.g., text, date and number). Ontology alignment. Ontology alignment is the process of finding mappings or cor- respondences between a source and a target ontology or knowledge graph [3]. These mappings are typically represented as equivalences among the entities of the input re- sources (e.g., ncbi:DaphniaMagna owl:sameAs ecotox:daphniamagna). Embedding models. KG embedding [17] plays a key role in link prediction problems where the goal is to learn a scoring function S : E × R × E → R. S(s, p, o) is propor- tional to the probability that a triple hs, p, oi is encoded as true. Several models has been proposed, e.g., Translating embeddings model (TransE) [2]. These models are applied to KGs to resolve missing facts in largely connected KGs, such as DBpedia [9]. 3 The TERA knowledge graph We construct the Toxicology and Risk Assessment (TERA) KG from four sources: (i) The effect data is gathered from ECOTOX [15]. We focus our effort on acute effects, e.g., LC50 (lethal concentration for 50% of test species) and NR-ZERO (no effect on all test species). This data is converted to a compound-species pair and a label (true or false). (ii) The chemical hierarchy is created by combining RDF data available from PubChem [8] and querying the ChEMBL [4] SPARQL endpoint. (iii) The species hier- archy is gathered from the tabular data available in the NCBI Taxonomy [14]. (iv) We gather species habitat and endemic data from the Encyclopedia of Life (EOL) [13]. We align the four data sources using LogMap and the Wikidata SPARQL endpoint. Details of the construction of the TERA knowledge graph is available in [10]. 4 Effect prediction We learn different types of classification models, including Gaussian naive-bayes (NB), quadratic discriminant analysis (QDA), radial basis function kernel support-vector ma- chine (SVM), and multilayer perceptron (MLP), to solve the problem described in Fig- ure 1. The input is a compound-species pair. It is encoded either as the the concatenation Af f ects c1 s1 type type CA N ot af f ects SA subClassOf type type subClassOf CR c2 Af f ects s2 SR type subClassOf CB Af f ects SB type type c3 s3 Fig. 1: The effect prediction problem. Lowercase sj and ci are instances of species and com- pounds, while uppercase denote classes in the hierarchy. Solid lines are observations and dashed lines are to be predicted. i.e., does c2 affect s1 ? Knowledge Graph Embedding for Ecotoxicological Effect Prediction 3 (a) Accuracy (b) Precision (c) Recall (d) AUC Fig. 2: Prediction results for the different models. of the one-hot vectors of the compound and the species (baseline), or the concatenation of the embedding vectors learned by the embedding model (TransE [2], DistMult [18] or HolE [12]). These models where considered since they are intuitive, have show state-of- the-art performance (e.g., [7]), and encodes directional relationships, respectfully. The output is binary: Affects (1) and Not affects (0), representing the compound affects the species or not. 5 Results and Discussion Results. Figure 2 shows the results of different models using different encoding methods of the input (compound-species pair). We find that two out of the four testing models, namely SVM and MLP achieve higher performance with KG embedding than with one-hot encoding. For the QDA model, KG embedding also has higher recall than one- hot encoding, although the overall metrics AUC and accuracy are similar. Note that recall is more important than precision in ecotoxicological effect prediction. The only exception is the NB model, where one-hot encoding has much higher performance than KB embedding. That is because NB holds the assumption that the input variables are conditional independent. Hence, it works better on the one-hot encoding which is quite sparse. However, it is worthwhile to note that the performance of NB with one-hot encoding does not outperform the MLP and QDA models with KB embedding. Conclusion. We have created a KG called TERA that aims at covering the knowledge and data relevant to the ecotoxicological domain. We have also implemented a proof- of-concept prototype for ecotoxicological effect prediction based on knowledge graph embeddings and classification models. Some of the models used can take advantage of the learned embedded features. However, simple models like NB preferred the one- hot encoded vectors. The obtained results are encouraging, showing the positive impact 4 E. B. Myklebust et. al. of using KG embedding models and the benefits of having an integrated view of the different knowledge and data sources. Future work. The main goal in the long-term future is to make the TERA-KG accessible for domain researchers and improve the effect prediction by enriching the KG. In the near future, we intend to improve the current ecotoxicological effect prediction proto- type and evaluate the suitability of more sophisticated models like Graph Convolutional Networks. Resources. The datasets, evaluation results, documentation and source codes are avail- able from the following GitHub repository: https://github.com/Erik-BM/NIVAUC Acknowledgements. This work is supported by the grant 272414 from the Research Council of Norway (RCN), the MixRisk project (RCN 268294), the AIDA project (The Turing Institute) and the SIRIUS Centre for Scalable Data Access (RCN 237889). References 1. Arnaout, H., Elbassuoni, S.: Effective Searching of RDF Knowledge Graphs. Web Seman- tics: Science, Services and Agents on the World Wide Web 48(0) (2018) 2. Bordes, A., et al.: Translating Embeddings for Modeling Multi-relational Data. In: Advances in Neural Information Processing Systems 26, pp. 2787–2795 (2013) 3. Euzenat, J., Shvaiko, P.: Ontology Matching, Second Edition. Springer (2013) 4. Hastings, J., et al.: ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research 44(D1), D12149 (January 2016) 5. Jiménez-Ruiz, E., Cuenca Grau, B.: LogMap: Logic-Based and Scalable Ontology Matching. In: 10th International Semantic Web Conference. pp. 273–288 (2011) 6. Jiménez-Ruiz, E., Cuenca Grau, B., Zhou, Y., Horrocks, I.: Large-scale interactive ontology matching: Algorithms and implementation. In: ECAI Conference. pp. 444–449 (2012) 7. Kadlec, R., Bajgar, O., Kleindienst, J.: Knowledge base completion: Baselines strike back. CoRR abs/1705.10744 (2017), http://arxiv.org/abs/1705.10744 8. Kim, S., et al.: PubChem 2019 update: improved access to chemical data. Nucleic Acids Research 47(D1), D1102–D1109 (10 2018) 9. Lehmann, J., et al.: DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2), 167–195 (2015) 10. Myklebust, E.B., Jiménez-Ruiz, E., Chen, J., Wolf, R., Tollefsen, K.E.: Enabling Semantic Data Access for Toxicological Risk Assessment. CoRR abs/1908.10128 (2019) 11. Myklebust, E.B., Jiménez-Ruiz, E., Chen, J., Wolf, R., Tollefsen, K.E.: Knowledge graph embedding for ecotoxicological effect prediction. In: Int’l Sem. Web Conf. (ISWC) (2019) 12. Nickel, M., Rosasco, L., Poggio, T.A.: Holographic embeddings of knowledge graphs. CoRR abs/1510.04935 (2015), http://arxiv.org/abs/1510.04935 13. Parr, C.S., et al.: The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth. (2014) 14. Sayers, E.W., et al.: Database resources of the National Center for Biotechnology Informa- tion. Nucleic Acids Research 37(suppl 1), D5–D15 (10 2008) 15. U.S. EPA: Ecotoxicology knowledgebase (ecotox) (2019), https://cfpub.epa.gov/ecotox/ 16. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014) 17. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017) 18. Yang, B., tau Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575 (2015)