A Knowledge Graph for Ecotoxicological Risk Assessment and Effect Prediction? Erik B. Myklebust 1 Norwegian Institute for Water Research, Oslo, Norway 2 Department of Informatics, University of Oslo, Oslo, Norway erik.b.myklebust@niva.no Abstract. Exploring the effects a chemical compound has on a species takes a considerable experimental effort. Appropriate methods for es- timating and suggesting new effects can dramatically reduce the work needed to be done by a laboratory. In this PhD research we aim at ex- ploring the suitability of using a knowledge graph embedding approach for ecotoxicological effect prediction. A knowledge graph is being con- structed from publicly available data sets, including a species taxonomy and chemical classification and similarity. We use ontology alignment techniques to integrate the effect data into the knowledge graph. Our preliminary experimental results show that the knowledge graph based approach improves the selected baselines. Keywords: Knowledge graph · Semantic embedding · Ecotoxicology 1 Problem statement Ecotoxicological risk assessment is the task of estimating the risk to a ecosystem by foreign chemicals. The diverse datasets used in risk assessment needs to be aggregated into a common vocabulary before being used in the risk prediction process. This aggregation requires the use of (semi-)manually curated mappings. Creating these mappings is a tremendous task for the domain experts that would benefit from suitable tool support. At the heart of the data is the effects. This data describes the effects com- pounds has on species. The majority of effect data relates a compound-species pair to a mortality or chronic (e.g., reproductive) effect. Due to the large search space of compound-species pairs, less than 1% of possible combinations has been studied. As a result of the large cost and effort to conduct these experiments, this proportion will not suddenly increase. Hence, we have two main research tasks, where the latter is reliant on the first. These can be summarized as follows: (i) Create a knowledge graph by gathering and integrating the relevant biolog- ical effect data and knowledge, such that to relieve the (domain) researchers of the manual work. ? Copyright ©2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 E. B. Myklebust Fig. 1: Risk assessment pipeline. (ii) Using the knowledge graph together with machine learning techniques to predict effects. The objectives of this task are twofold: (a) Limit the search space for the laboratory (binary prediction). (b) Predict effects outright with a margin of error (regression). 2 Background and related work In this section we introduce some preliminaries and give insights into the current state of the art efforts applying semantic web technologies within the field of toxicology and risk assessment. Use case. Ecotoxicology is a multidisciplinary field that studies the ecologi- cal and toxicological effects of chemical pollutants on populations, communities and ecosystems. Risk assessment is the result of the intrinsic hazards of a sub- stance combined with an estimate of the environmental exposure (i.e., Hazard + Exposure = Risk). Figure 1 shows a risk assessment pipeline. Exposure is data gathered from the environment, while effects are hypothesis that are tested in a laboratory. These two data sources are used to calculate risk, which is used to find (further) susceptible species and the mode of action (MoA) or type of impact a compound would have over those species. Results from the MoA analysis are used as new effect hypothesis. Effect prediction. Estimating the effect a compound has on a species is a large research field within ecotoxicology. Currently, state-of-the-art solutions such as Quantitative Structure-Activity Relationship (QSAR) models ( e.g., [7, 13, 14]) exists. However, these are limited in scope. Each QSAR consider small groups of compounds and a single or a few species. Therefore, a general approach suited for a larger subset of the domain is favourable. A KG for Ecotoxicological Risk Assessment and Effect Prediction 3 Knowledge graphs. We follow the RDF-based notion of knowledge graphs [4] which are composed by RDF triples hs, p, oi, where s represents a subject (a class or an instance), p represents a predicate (a property) and o represents an object (a class, an instance or a data value e.g., text, date and number). RDF entities (i.e., classes, properties and instances) are represented by an URI (Uniform Re- source Identifier). A knowledge graph can be split into a TBox (terminology), often composed by RDF Schema constructors like class subsumption and prop- erty domain and range,3 and an ABox (assertions), which contain relationships among instances and semantic type definitions. RDF-based Knowledge Graphs can be accessed with SPARQL queries, the standard language to query RDF graphs. There is emerging work in improving the usability of ecotoxicological data by mapping to knowledge graphs or ontologies, e.g., [10], however, currently this work is limited. We are unaware of work incorporating the vast array of sources that is required from beginning to end by a risk assessment system. Ontology alignment. Ontology alignment is the process of finding mappings or correspondences between a source and a target ontology or knowledge graph [9]. These mappings are typically represented as equivalences among the entities of the input resources. Currently, mapping ecotoxicological data to different sources are under con- struction. The ECOTOX web search interface4 now contains mappings to a external taxonomy source [20] (for a limited number of taxons). Fay et al. [10] indicates a full mapping to external sources exists, however, this is not yet pub- licly available. We are not aware of efforts toward mapping taxonomic classes, e.g., genus, family, etc. which can reveal inconsistencies in the datasets. Embedding models. Knowledge graph embedding [24] plays a key role in link prediction problems where the goal is to learn a scoring function S : E ×R×E → R. S(s, p, o) is proportional to the probability that a triple hs, p, oi is encoded as true. Several models has been proposed, e.g., Translating embeddings model (TransE) [5]. These models are applied to knowledge graphs to resolve missing facts in largely connected knowledge graphs, such as DBPedia [17]. There is previous work investigating modelling of chemical effects, e.g., [16, 12]. The prediction of ecotoxicological effects can be seen as a sub-problem. These works investigate models that use the chemical structures to determine their effect on species. Yet, we are not aware of approaches where multiple knowledge graph embeddings are used to model the interaction between knowledge graphs. 3 The OWL 2 ontology language provides more expressive constructors. Note that the graph projection of an OWL 2 ontology can be seen as a knowledge graph (e.g., [1]). 4 https://cfpub.epa.gov/ecotox/ 4 E. B. Myklebust 3 Relevance The relevance of the research to be conducted during the PhD can be summarized as follows: (i) Manually integrating background knowledge into risk assessment systems is cumbersome since a common vocabulary does not exists. Our approach will reduce the time spent organizing data, and increase the number of case studies than can be conducted. A common vocabulary will enhance the interoperability between several risk assessment systems, increasing the confidence in the assessments. (ii) The effect data used in risk assessment models is the result of time-consuming laboratory work. By using machine learning techniques with background knowledge, in the form of a knowledge graph, we aim at being able to limit the search space for new tests to be analysed in the laboratory. For exam- ple, we aim at recommending the top-ten compounds to test on a specific species, rather than conducting experiments using thousands of possible compounds. (iii) Design and implementation of a fully-fledged recommender system to pre- dict the level of effect on a species. For example, DEET (pesticide) has the potential to kill 50% of the population of the common house fly. Such generalization using the available data and knowledge is the main target of the research, which aims at reducing to a minimum further laboratory analysis. 4 Research questions and hypothesis This work aims to address the following questions: a. Can the disparate data sources used in ecotoxicological risk assessment be integrated into a knowledge graph to improve accessibility? b. Can the knowledge graph be used to improve (or diversify) ecotoxicological effect prediction over current state-of-the-art models? The hypothesis associated with the above questions are: A. It is possible to integrate disparate data sources in a toxicological knowledge graph using Semantic Web tools. B. Extrapolation of effect data increase the reach of risk assessment systems while remaining accurate. 5 Approaches This section will describe the approaches used to investigate the hypothesis above. The evaluation of the hypothesis is described in Section 7. Hypothesis A. There are multiple sources, varying from tabular, SPARQL endpoints, REST APIs, and RDF formats, each with its own vocabulary, that needs to be integrated to enable a unified data access. The main sources of data are: A KG for Ecotoxicological Risk Assessment and Effect Prediction 5 ECOT OX Split Species Ef f ects Compounds T ransf orm M ap Alignment TERA-KG Alignment T ransf orm SP ARQL Import N CBI ChEBI P ubChem Fig. 2: Data sources in the TERA knowledge graph. Compound classification is available from PubChem. Chemical class hierarchy from the ChEMBL SPARQL endpoint. Compound literals are gathered from PubChem REST API and trans- formed into triples. ECOTOX and PubChem identifiers are aligned using the Wikidata SPARQL endpoint. ECOTOX and NCBI taxonomies are aligned us- ing ontology alignment. (i) Effect data (ECOTOX [23], example seen in Table 1) in tabular format. Includes limited metadata linked to proprietary identifiers for compounds and species. (ii) Compound data from different sources. Hierarchies available through down- loadable RDF files and SPARQL endpoints (PubChem [22] and ChEMBL [6]). Compound features, e.g., Molecular weight, XLogP etc. are available through the PubChem REST API. (iii) The tabular NCBI taxonomy [20] is used as the hierarchy for species. We must map the identifiers used in the effect data to open standards to take advantage of the diversity of data sources. The created Toxicological Effects and Risk Assessment (TERA) knowledge graph with current sources and aggregation steps is shown in Figure 2. Excerpts of triples from TERA are shown in Table 2. test id reference number test cas species number 1068553 5390 877430 (2,6-Dimethylquinoline) 5156 (Danio rerio) 2037887 848 79061 (2-Propenamide) 14 (Rasbora heteromorpha) result id test id endpoint conc1 mean conc1 unit 98004 1068553 LC50 400 mg/kg diet 2063723 2037887 LC10 220 mg/L Table 1: ECOTOX database entry examples. 6 E. B. Myklebust # subject predicate object (i) ecotox:group/Worms owl:disjointWith ecotox:group/Fish (ii) ncbi:division/2 owl:disjointWith ncbi:division/4 (iii) ecotox:taxon/34010 rdfs:subClassOf ecotox:taxon/hirta (iv) ncbi:taxon/687295 rdfs:subClassOf ncbi:taxon/513583 (v) compound:CID10198308 rdf:type obo:CHEBI 134899 (vi) compound:CID10198308 pubchem:formula ‘‘C7 H6 O6 S’’ (vii) ecotox:chemical/115866 ecotox:affects ecotox:effect/001 (viii) ecotox:effect/001 ecotox:species ecotox:taxon/26812 (ix) ecotox:effect/001 ecotox:endpoint LC50 (x) ecotox:taxon/33155 owl:sameAs ncbi:taxon/311871 Table 2: Example triples from the TERA knowledge graph Improving the knowledge graph can be done with several sources. First, a dataset containing biological activity, e.g., Chemical ontology (CO) [11]. Such datasets would enable finer grained data to be used by the effect predictor. We also aim at including an anatomy dataset, such that the biological activity can be aggregated from proteins to individual level. Another aspect important to effect prediction is the habitat of the species, e.g., [8]. Including the species habitat data will limit the effect prediction search space further, e.g., heavy insoluble compounds (sinks in water) would have lit- tle/no effect on fish. Hypothesis B. The prediction task at hand is depicted in Figure 3. Initially, we use a naive approach, which is to assume that similar compounds has a comparable effect on the same species and vice versa. The state of the art in risk assessment systems implement akin solutions. However, it is not clear what constitutes similarity in this context. The similarity between compounds are quantifiable using different methods, however, similarity does not imply sim- ilar biological activity [18]. For species, the naive solution is to calculate the taxonomic distance, but again the classifications of species is not defined by the susceptibility to compounds. Consequently, additional sources that describe these phenomena need to be added to the knowledge graph. When the knowl- edge graph is enriched with this data we can explore modelling techniques for embedding the knowledge graph for the purpose of predicting effects. We aim at applying simple embedding methods, TransE [5], DistMult [25], and HolE [21], until their performance is exhausted. These model may preform adequately for producing recommendation to the lab, however, as shown in the next section these models cannot be fully trusted to predicting effects outright. Therefore, we intend to include more expressive models, such as Graph Convolution Networks (GCN) [15]. Current approaches in knowledge graph embedding do not consider sparsely connected knowledge graphs, such as the hierarchical structures that make up TERA. Therefore, we aim at using the classification power of GCNs to embed groups of species or compounds more accurately. This will include the use of the vast array of chemical properties (experimental or computed) and the protein classification available for most species. A KG for Ecotoxicological Risk Assessment and Effect Prediction 7 Af f ects c1 s1 type type CA N ot af f ects SA subClassOf type type subClassOf CR c2 Af f ects s2 SR type subClassOf CB Af f ects SB type type c3 s3 Fig. 3: The effect prediction problem. Lowercase sj and ci are instances of species and compounds, while uppercase denote classes in the hierarchy. Solid lines are observations and dashed lines are to be predicted. i.e., does c2 affect s1 ? 30 NN MLP MLP + Embedding Accuracy 0.53 0.81 0.82 0.82 0.81 Precision 0.45 0.78 0.78 0.78 0.75 Recall 0.80 0.74 0.79 0.79 0.80 F1 score 0.58 0.76 0.78 0.78 0.77 Fβ=2 score 0.69 0.75 0.78 0.78 0.79 AUC − 0.89 0.89 0.90 0.89 Table 3: Performance of the prediction models. All values are averages over 10 clean test runs. NN is the graph nearest-neighbour approach (using the closest 30 neighbours). MLP is the multi-layer perceptron model. The three values under MLP + Embedding are the results using TransE, DistMult, HolE embedding methods, respectively. 6 Preliminary results We have evaluated three (plus variants) prediction models based on the effect data and the TERA knowledge graph. Note that currently the TERA knowledge graph has been created with the bare minimum of sources required to integrate the effect data with external metadata for compounds and species. Selected re- sults are shown in Table 3. Prediction models used in this preliminary evaluation: (i) A nearest-neighbour approach. A compound-species pair can inherent an effect if another compound or species is close in the knowledge graph. This method provides a useful baseline. However, the performance of this method is far from ideal, as it either will over or underestimate effects based on the number of neighbours considered. (ii) A zero-background-knowledge multi-layer perceptron (MLP) model was applied to the effects data. This model is able to learn simple relations, e.g., 8 E. B. Myklebust (a) Accuracy for the MLP prediction models. (b) Recall for the MLP prediction models. Fig. 4: Accuracy and Recall for the MLP models with various thresholds. s1 and s2 is effected by c3 , therefore, c3 is toxic and will effect s3 . However, when this model is presented with previously unseen compound-species pairs, it cannot rely on background knowledge, and hence, the prediction will be highly flawed. (iii) Using knowledge graph embedding ([5, 25, 21]) on TERA, followed by the same MLP model architecture as above yields better results for recall (which is preferred), while accuracy remains similar. In contrast to the above model, this model is more uncertain when unseen combinations are presented to the model (in dubio pro reo). As shown in Figures 4a and 4b, lowering the decision threshold (from 0.5 to 0.35) would yield a higher recall (0.93) for the HolE-based model, while reducing the accuracy (0.75). A KG for Ecotoxicological Risk Assessment and Effect Prediction 9 The obtained predictions are promising and show the potential usefulness of the machine learning models in our setting and the benefits of using the TERA knowledge graph. As mentioned before, we favour recall with respect to precision. One the one hand, false positives are not necessarily harmful, while overlooking the hazard of a chemical may have important consequences. On the other hand, due to the limited experiments in terms of concentration (i.e., effect data may not be complete), some chemicals may look less toxic than others while they may still be hazardous. At the same time the adoption of a RDF-based knowledge graph enables the use of an extensive range of Semantic Web infrastructure that is currently available (e.g., reasoning engines, ontology alignment systems, SPARQL query engines). 7 Evaluation plan In this section, we introduce the evaluation plan for the success of this project. We can divide the evaluation of both research questions into qualitative and quantitative measures. The value of the knowledge graph in toxicology research is uncertain at this stage. The knowledge graph must provide value for the researchers. We can ensure this by evaluating the quality of the knowledge graph. Our definition of quality is that the knowledge graph should have high levels of: (i) Coverage. The sources included in the knowledge graph must cover the areas of interest. The coverage also relates to the degree of successful map- pings between the sources. There will be a trade-off between completeness and correctness of the mappings. (ii) Integration. The ease of integrating the various sources. This involves align- ing and mapping to attain a consolidated knowledge graph. (iii) Functionality. The ability of the knowledge graph to be integrated into the risk assessment systems. This includes keeping the flexibility of Semantic Web technology without a commitment to a schema. We can add new triples and extend the knowledge graph, without the need of major changes. (iv) Embedding enrichment. The semantic enrichment the knowledge graph gives the embeddings compared to embeddings learned from effects. The quantitative evaluation of the knowledge graph generation is tightly related to that of evaluating the effect prediction. We can evaluate the ability to make good effect predictions quite easily. This can be done with precision, recall, accuracy, etc. for binary effects or with mean squared error, R2 -score, etc. for regression. However, the value of these predictions is integrating them into the risk assessment pipeline. The evaluation metrics must be inline with the ability our methods have to enhance risk assessment. We will compare environmental case studies results before and after the use of our modelling results. Since there is no ground truth data for risk assessment we rely on domain experts to determine if our contributions adds value to the assessments. Risk assessments has currently large margins of errors (experimental errors etc.), and we may introduce new sources of error with our effect predictions. 10 E. B. Myklebust However, we are confident that errors can also be reduced by greater data cov- erage. These are different types of errors and part of the evaluation process will be to find the optimal trade-off between them. The current preliminary results uses random dataset splits for training and testing the models. We aim at introducing highly selective datasets that can test predictive performance in different scenarios. We will also try a completely clean test, where we recommend compound-species pairs to be tested in the lab. This will obviously be limited by the available compounds and test species of the particular laboratory. The methodologies and knowledge graphs will be publicly available such that feedback from the community can help us evaluate and improve our contribu- tions. 8 Reflections The conducted work falls into one of the main research lines of toxicology re- search to enhance the generation of hypothesis to be tested in the laboratory [19]. Furthermore, the data integration efforts and the construction of the TERA knowledge graph is a large contribution to the area of risk assessment. The availability and accessibility of the best knowledge and data will enable optimal decision making. Knowledge graph embedding models have been applied in general purpose link discovery and knowledge graph completion tasks [24]. They have also at- tracted the attention in the biomedical domain to find, for example, candidate genes for a disease, protein-protein interactions or drug-target interactions (e.g., [3, 2]). However, we are not aware of the application of knowledge graph embed- ding models in the context of toxicological effect prediction. Acknowledgements This PhD project is supported by grant 272414 from the Research Council of Norway. I would like to thank my supervisors, Ernesto Jimenez-Ruiz (The Alan Turing Institute and University of Oslo), Raoul Wolf (Norwegian Institute for Water Research), and Knut Erik Tollefsen (Norwegian Institute for Water Re- search) for their feedback on this work. In addition, I would also like to thank Jiaoyan Chen (University of Oxford), Martin Giese (University of Oslo) and Zofia C. Rudjord (Norwegian Institute for Water Research) for their contribution in different stages of this PhD research. References 1. Agibetov, A., Jiménez-Ruiz, E., Ondresik, M., Solimando, A., Banerjee, I., Guer- rini, G., Catalano, C.E., Oliveira, J.M., Patanè, G., Reis, R.L., Spagnuolo, M.: Supporting shared hypothesis testing in the biomedical domain. J. Biomedical Se- mantics 9(1), 9:1–9:22 (2018) A KG for Ecotoxicological Risk Assessment and Effect Prediction 11 2. Agibetov, A., Samwald, M.: Global and local evaluation of link prediction tasks with neural embeddings. In: 4th Workshop on Semantic Deep Learning (ISWC workshop). pp. 89–102 (2018) 3. Alshahrani, M., Khan, M.A., Maddouri, O., Kinjo, A.R., Queralt-Rosinach, N., Hoehndorf, R.: Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33(17), 2723–2730 (2017) 4. Arnaout, H., Elbassuoni, S.: Effective Searching of RDF Knowledge Graphs. Web Semantics: Science, Services and Agents on the World Wide Web 48(0) (2018) 5. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems 26, pp. 2787–2795. Curran Associates, Inc. (2013) 6. ChEBI-ontology: The european bioinformatics institute (2019), https://www.ebi.ac.uk/chebi/ 7. Escher, B.I., Baumer, A., Bittermann, K., Henneberger, L., Knig, M., Khnert, C., Klver, N.: General baseline toxicity qsar for nonpolar, polar and ionisable chemicals and their mixtures in the bioluminescence inhibition assay with aliivibrio fischeri. Environ. Sci.: Processes Impacts 19, 414–428 (2017) 8. European Environment Agency: Linkages of species and habitat types to maes ecosystems (2015), https://www.eea.europa.eu/data-and-maps/data/linkages-of- species-and-habitat 9. Euzenat, J., Shvaiko, P.: Ontology Matching, Second Edition. Springer (2013) 10. Fay, K., Elonen, C., Hoff, D., Skopinski, M., Pilli, A., Wang, R., LaLone, C.: En- hancing the Utility of the ECOTOX knowledgebase (ECOTOX KB) via ontology- based semantics mapping. In: SETAC Europe, Rome, ITALY, May 14 - 18, 2018. (2018) 11. Feldman, H.J., Dumontier, M., Ling, S., Haider, N., Hogue, C.W.: Co: A chemical ontology for identification of functional groups and semantic comparison of small molecules. FEBS Letters 579(21), 4685 – 4691 (2005) 12. Forbes, V.E., Calow, P., Sibly, R.M.: Are current species extrapolation models a good basis for ecological risk assessment? Environmental Toxicology and Chemistry 20(2), 442–447 (2001) 13. Khan, K., Benfenati, E., Roy, K.: Consensus qsar modeling of toxicity of pharma- ceuticals to different aquatic organisms: Ranking and prioritization of the drug- bank database compounds. Ecotoxicology and Environmental Safety 168, 287 – 297 (2019) 14. Khan, K., Khan, P.M., Lavado, G., Valsecchi, C., Pasqualini, J., Baderna, D., Marzo, M., Lombardo, A., Roy, K., Benfenati, E.: Qsar modeling of daphnia magna and fish toxicities of biocides using 2d descriptors. Chemosphere 229, 8 – 17 (2019) 15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907 (2016), http://arxiv.org/abs/1609.02907 16. Laender, F.D., Morselli, M., Baveco, H., den Brink, P.V., Guardo, A.D.: Theoret- ically exploring direct and indirect chemical effects across ecological and exposure scenarios using mechanistic fate and effects modelling. Environment International 74, 181 – 190 (2015) 17. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - A large- scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2), 167–195 (2015) 18. Martin, Y.C., Kofron, J.L., Traphagen, L.M.: Do structurally similar molecules have similar biological activity? Journal of Medicinal Chemistry 45(19), 4350–4358 (Sep 2002) 12 E. B. Myklebust 19. Myklebust, E.B., Jimenez-Ruiz, E., Rudjord, Z.C., Wolf, R., Tollefsen, K.E.: In- tegrating semantic technologies in environmental risk assessment: A vision. In: 29th Annual Meeting of the Society of Environmental Toxicology and Chemistry (SETAC) (2019) 20. NCBI-Taxonomy: The national center for biotechnology information (2019), https://www.ncbi.nlm.nih.gov/taxonomy 21. Nickel, M., Rosasco, L., Poggio, T.A.: Holographic embeddings of knowledge graphs. CoRR abs/1510.04935 (2015), http://arxiv.org/abs/1510.04935 22. PubChem: National institutes of health (nih) (2019), https://pubchem.ncbi.nlm.nih.gov/ 23. U.S. EPA: Ecotoxicology knowledgebase (ecotox) (2019), https://cfpub.epa.gov/ecotox/ 24. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017) 25. Yang, B., tau Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575 (2015)