<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge Graph Embedding for Ecotoxicological Effect Prediction?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erik B. Myklebust</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto Jimenez-Ruiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiaoyan Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raoul Wolf</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knut Erik Tollefsen</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alan Turing Institute</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Oxford</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Norwegian Institute for Water Research (NIVA)</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Exploring the effects of a chemical compound on a species takes a considerable experimental effort. Appropriate methods for estimating and suggesting new effects can dramatically reduce the work needed to be done by a laboratory. Here, we explore the suitability of using a knowledge graph embedding approach for ecotoxicological effect prediction. A knowledge graph has been constructed from publicly available data sets, including a species taxonomy and chemical knowledge. These knowledge sources are integrated by ontology alignment techniques. Our experimental results show that the knowledge graph and its embeddings augment the baseline models.1 (i) We have created a KG by gathering and integrating the relevant data from disparate sources. In order to discover equivalent entities we exploit internal resources, external resources (e.g., Wikidata [16]) and ontology alignment (e.g., LogMap [6, 5]). (ii) We have evaluated three KG embedding approaches (TransE [2], DistMult [18] and HolE [12]) together with a baseline based on a one-hot encoding. Out evaluation shows improvement in the metrics using KG embedding for a majority of the selected classification models. Note that, recall is preferred over precision, i.e., rather overestimate the effect of a chemical compound, than underestimate its hazardousness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>It takes immense experimental efforts to determine ecotoxicological effects a
chemical compound has on a species. These effect data is available for a narrow range of
compound-species pairs and a limited number of experimental test.</p>
      <p>Here, we present a preliminary study of the benefits of using Semantic Web tools to
integrate different data sources and knowledge graph (KG) approaches to improve the
ecotoxicological effect prediction over a baseline. Hence, our contribution is twofold:</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        Knowledge graphs. We follow the RDF-based notion of KGs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which are composed
by RDF triples hs; p; oi, where s represents a subject (a class or an instance), p
represents a predicate (a property) and o represents an object (a class, an instance or a data
value e.g., text, date and number).
      </p>
      <p>
        Ontology alignment. Ontology alignment is the process of finding mappings or
correspondences between a source and a target ontology or knowledge graph [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These
mappings are typically represented as equivalences among the entities of the input
resources (e.g., ncbi:DaphniaMagna owl:sameAs ecotox:daphniamagna).
Embedding models. KG embedding [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] plays a key role in link prediction problems
where the goal is to learn a scoring function S : E R E ! R. S(s; p; o) is
proportional to the probability that a triple hs; p; oi is encoded as true. Several models has been
proposed, e.g., Translating embeddings model (TransE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These models are applied
to KGs to resolve missing facts in largely connected KGs, such as DBpedia [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The TERA knowledge graph</title>
      <p>
        We construct the Toxicology and Risk Assessment (TERA) KG from four sources:
(i) The effect data is gathered from ECOTOX [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We focus our effort on acute effects,
e.g., LC50 (lethal concentration for 50% of test species) and NR-ZERO (no effect on
all test species). This data is converted to a compound-species pair and a label (true or
false). (ii) The chemical hierarchy is created by combining RDF data available from
PubChem [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and querying the ChEMBL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] SPARQL endpoint. (iii) The species
hierarchy is gathered from the tabular data available in the NCBI Taxonomy [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. (iv) We
gather species habitat and endemic data from the Encyclopedia of Life (EOL) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We
align the four data sources using LogMap and the Wikidata SPARQL endpoint. Details
of the construction of the TERA knowledge graph is available in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Effect prediction</title>
      <p>We learn different types of classification models, including Gaussian naive-bayes (NB),
quadratic discriminant analysis (QDA), radial basis function kernel support-vector
machine (SVM), and multilayer perceptron (MLP), to solve the problem described in
Figure 1. The input is a compound-species pair. It is encoded either as the the concatenation</p>
      <p>CA
subClassOf type
CR</p>
      <p>CB
type
type
type
c1
c2</p>
      <p>Affects
Not affects
Affects
Affects
s1
s2
type</p>
      <p>SA
type subClassOf</p>
      <p>SR
subClassOf</p>
      <p>SB
type
c3 s3
Fig. 1: The effect prediction problem. Lowercase sj and ci are instances of species and
compounds, while uppercase denote classes in the hierarchy. Solid lines are observations and dashed
lines are to be predicted. i.e., does c2 affect s1?</p>
      <p>
        Knowledge Graph Embedding for Ecotoxicological Effect Prediction
(a) Accuracy
(b) Precision
(c) Recall
(d) AUC
of the one-hot vectors of the compound and the species (baseline), or the concatenation
of the embedding vectors learned by the embedding model (TransE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], DistMult [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] or
HolE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). These models where considered since they are intuitive, have show
state-ofthe-art performance (e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), and encodes directional relationships, respectfully. The
output is binary: Affects (1) and Not affects (0), representing the compound affects the
species or not.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Results. Figure 2 shows the results of different models using different encoding methods
of the input (compound-species pair). We find that two out of the four testing models,
namely SVM and MLP achieve higher performance with KG embedding than with
one-hot encoding. For the QDA model, KG embedding also has higher recall than
onehot encoding, although the overall metrics AUC and accuracy are similar. Note that
recall is more important than precision in ecotoxicological effect prediction. The only
exception is the NB model, where one-hot encoding has much higher performance than
KB embedding. That is because NB holds the assumption that the input variables are
conditional independent. Hence, it works better on the one-hot encoding which is quite
sparse. However, it is worthwhile to note that the performance of NB with one-hot
encoding does not outperform the MLP and QDA models with KB embedding.
Conclusion. We have created a KG called TERA that aims at covering the knowledge
and data relevant to the ecotoxicological domain. We have also implemented a
proofof-concept prototype for ecotoxicological effect prediction based on knowledge graph
embeddings and classification models. Some of the models used can take advantage
of the learned embedded features. However, simple models like NB preferred the
onehot encoded vectors. The obtained results are encouraging, showing the positive impact
of using KG embedding models and the benefits of having an integrated view of the
different knowledge and data sources.</p>
      <p>Future work. The main goal in the long-term future is to make the TERA-KG accessible
for domain researchers and improve the effect prediction by enriching the KG. In the
near future, we intend to improve the current ecotoxicological effect prediction
prototype and evaluate the suitability of more sophisticated models like Graph Convolutional
Networks.</p>
      <p>Resources. The datasets, evaluation results, documentation and source codes are
available from the following GitHub repository: https://github.com/Erik-BM/NIVAUC
Acknowledgements. This work is supported by the grant 272414 from the Research
Council of Norway (RCN), the MixRisk project (RCN 268294), the AIDA project (The
Turing Institute) and the SIRIUS Centre for Scalable Data Access (RCN 237889).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arnaout</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elbassuoni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Effective Searching of RDF Knowledge Graphs</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>48</volume>
          (
          <issue>0</issue>
          ) (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Translating Embeddings for Modeling Multi-relational Data</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pp.
          <fpage>2787</fpage>
          -
          <lpage>2795</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Ontology Matching,
          <source>Second Edition</source>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hastings</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>ChEBI in 2016: Improved services and an expanding collection of metabolites</article-title>
          .
          <source>Nucleic acids research</source>
          <volume>44</volume>
          (
          <issue>D1</issue>
          ),
          <source>D12149 (January</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jime</surname>
          </string-name>
          <article-title>´nez-</article-title>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuenca Grau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>LogMap: Logic-Based and Scalable Ontology Matching</article-title>
          .
          <source>In: 10th International Semantic Web Conference</source>
          . pp.
          <fpage>273</fpage>
          -
          <lpage>288</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jime</surname>
          </string-name>
          <article-title>´nez-</article-title>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuenca Grau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Large-scale interactive ontology matching: Algorithms and implementation</article-title>
          .
          <source>In: ECAI Conference</source>
          . pp.
          <fpage>444</fpage>
          -
          <lpage>449</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kadlec</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bajgar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleindienst</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Knowledge base completion: Baselines strike back</article-title>
          .
          <source>CoRR abs/1705</source>
          .10744 (
          <year>2017</year>
          ), http://arxiv.org/abs/1705.10744
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>PubChem 2019 update: improved access to chemical data</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>47</volume>
          (
          <issue>D1</issue>
          ),
          <fpage>D1102</fpage>
          -
          <lpage>D1109</lpage>
          (10
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Myklebust</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <article-title>Jime´nez-</article-title>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tollefsen</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          :
          <article-title>Enabling Semantic Data Access for Toxicological Risk Assessment</article-title>
          . CoRR abs/
          <year>1908</year>
          .10128 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Myklebust</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <article-title>Jime´nez-</article-title>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tollefsen</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          :
          <article-title>Knowledge graph embedding for ecotoxicological effect prediction</article-title>
          .
          <source>In: Int'l Sem. Web Conf. (ISWC)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nickel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosasco</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggio</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          :
          <article-title>Holographic embeddings of knowledge graphs</article-title>
          .
          <source>CoRR abs/1510</source>
          .04935 (
          <year>2015</year>
          ), http://arxiv.org/abs/1510.04935
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Parr</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          , et al.:
          <article-title>The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth</article-title>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sayers</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          , et al.:
          <article-title>Database resources of the National Center for Biotechnology Information</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>37</volume>
          (
          <issue>suppl 1</issue>
          ),
          <fpage>D5</fpage>
          -
          <lpage>D15</lpage>
          (10
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. U.S. EPA:
          <article-title>Ecotoxicology knowledgebase (ecotox) (</article-title>
          <year>2019</year>
          ), https://cfpub.epa.gov/ecotox/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Kro¨tzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>29</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2724</fpage>
          -
          <lpage>2743</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>tau Yih</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Embedding entities and relations for learning and inference in knowledge bases</article-title>
          .
          <source>CoRR abs/1412</source>
          .6575 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>