<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Predicate Information from a Knowledge Graph</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>to Identify Disease Trajectories Vlietstra</string-name>
          <email>w.vlietstra@erasmusmc.nl</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>van Mulligen</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Medical Informatics, Erasmus Medical Centre</institution>
          ,
          <addr-line>Rotterdam, 3015 GE</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Methodology &amp; Statistics, Maastricht University</institution>
          ,
          <addr-line>Maastricht, 6229 HA</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>25</fpage>
      <lpage>27</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Knowledge graphs can represent the contents of biomedical
literature and databases as subject-predicate-object triples,
where predicates describe relationships between pairs of
biomedical entities. For example, the Reactome database
contains the triple “GTF2H2-controls the expression
ofMDC1”, and SemMedDB, which obtains its triples through
text-mining, contains the triple “IL1B-stimulates-MCP1”.
By integrating triples from different sources with each other
in a knowledge graph, the comprehensive body of
biomedical knowledge can be computationally analyzed.</p>
      <p>Analyses performed on knowledge graphs often aim to
identify new relationships, e.g., between drugs and diseases,
genes and phenotypes, or between diseases. However, from
large-scale observational studies we know that multiple
diseases in patients are often diagnosed in specific temporal
sequences, which are referred to as disease trajectories. Using
knowledge graphs to identify disease trajectories therefore
requires both identifying the correct pair of diseases, as well
as their correct temporal sequence.</p>
      <p>Because protein networks are involved with metabolic,
signaling, immune, and gene-regulatory networks, they are
often used to mechanistically explain relationships between
diseases. So-called disease proteins, which are proteins
coded by genes associated with a disease, can be used to
represent diseases on a protein level. However, until now
predicates between proteins have been rarely used, even
though they, by describing the relationships between
(disease) proteins, can provide additional information about the
mechanism by which one disease can lead to another. We
therefore aim to exploit the predicate information from paths
between (disease) proteins in a knowledge graph to
determine whether a sequence of two diseases forms a trajectory.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>
        The temporal disease trajectories as described by Jensen et
al. were used as a reference set
        <xref ref-type="bibr" rid="ref2">(Jensen 2014)</xref>
        . They
analyzed diagnoses in 6.2 million electronic patient records of
the Danish population, assigned during 14.9 years, to
identify common disease trajectories. From these trajectories,
we only used those that describe a sequence of two diseases.
A complementary, negative set of non-trajectories was
constructed by creating random pairs of the diseases in the
reference set, as well as the reversed (incorrect) temporal
sequence of the trajectories in the reference set. Associations
between proteins and diseases were obtained from the
manually curated subset of DisGeNet (
        <xref ref-type="bibr" rid="ref3">Piñero 2017</xref>
        ).
      </p>
      <p>Three scenarios of paths between the disease proteins of
pairs of diseases were extracted from the knowledge graph:
1) Overlap, where two diseases A and B share the same
disease protein. Optionally, this disease protein has a
relationship to itself, e.g., if it can homodimerize.
2) Direct path, where there is a triple of which one of
the disease proteins of disease A and one of the
disease proteins of disease B form the subject and
object.
3) Indirect path, where one intermediate protein
connects the disease proteins of disease A and disease B,
requiring a sequence of two triples.</p>
      <p>
        Based on the predicates within these paths, six feature sets
were constructed. We compared two methods to represent
indirect relationships between disease proteins. The first
method constructs so-called metapaths
        <xref ref-type="bibr" rid="ref1">(Himmelstein 2017)</xref>
        ,
where the sequence of predicates in an indirect path is used
as a single feature. The second method considers each
predicate in the indirect paths as a separate feature
        <xref ref-type="bibr" rid="ref4">(Vlietstra
2018)</xref>
        .
      </p>
      <p>For both methods we experimented with three
variations of directional information of the predicates.
Directional information was never used when the same protein
was both subject and object of the triple (overlap
scenario).
1) Undirected: triples forming direct and indirect paths
between disease proteins are used without
information about which proteins are subject and object.
2) Directed: Each triple, in each direct and indirect path
between the disease proteins, has a direction as
indicated by its subject and object.
3) Mixed: Each predicate in the direct and indirect paths
is classified as directed or undirected based on prior
information.</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Our reference set contained 2,530 trajectories and 168,870
non-trajectories. We used random forests to train a
classification model. The cross-validated performance is shown in
Table 1, along with the number of features in the feature set.
Use of directional information of predicates substantially
improved performance as compared to not using this
information. However, disease trajectories could still be
identified with reasonable performance if only undirected
information was used.</p>
      <p>The metapath feature sets consisted of 7 to 14 times more
features than the split-path feature sets, and achieved a
superior performance as compared to the split-path features.
The performance difference between the mixed and the
directed metapath features was negligible. The performance
of split features increased if prior knowledge about directed
or undirected predicates was taken into account.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Our work demonstrates that disease trajectories can be
identified using predicate information from a protein knowledge
graph. Our machine learning based classifier is capable of
both identifying the correct pairs of diseases, as well as their
correct temporal sequence. While the use of directional
information of triples in our analysis improved performance,
even when no directional information is used our classifier
can identify directed relationships with reasonable
performance. The use of prior knowledge to classify
predicates as directed or undirected improves performance on
split path feature sets, but has no impact with metapath
feature sets. Metapaths result in many more features than the
split paths, and consistently achieve a superior performance.</p>
      <p>As future work we intend to perform a detailed error
analysis, where we will investigate whether there are specific
diseases whose trajectories are frequently misclassified. The
International Classification of Diseases (ICD) hierarchy can
be used to abstract diseases to a higher ICD level, thereby
obtaining insight into misclassifications at the level of
disease classes. Abstracting the diseases in the trajectories also
allows to examine whether specific combinations of ICD
classes are more frequently misclassified.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Himmelstein</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lizee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hessler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brueggeman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khankhanian</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baranzini</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Systematic integration of biomedical knowledge prioritizes drugs for repurposing</article-title>
          . In eLIFE,
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moseley</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oprea</surname>
            ,
            <given-names>T.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellesøe</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eriksson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmock</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>P.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients</article-title>
          .
          <source>In Nature Communications</source>
          ,
          <volume>5</volume>
          :
          <fpage>4022</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Piñero</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Bravo, À.,
          <string-name>
            <surname>Queralt-Rosinach</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutiérrez-Sacristán</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deu-Pons</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Centeno</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García-García</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furlong</surname>
            ,
            <given-names>L.I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants</article-title>
          .
          <source>In Nucleic Acids Research</source>
          ,
          <volume>45</volume>
          :
          <fpage>833</fpage>
          -
          <lpage>839</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Vlietstra</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sijbers</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Mulligen</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kors</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Using predicate and provenance information from a knowledge graph for drug efficacy screening</article-title>
          .
          <source>In Journal of Biomedical Semantics</source>
          ,
          <volume>9</volume>
          :
          <fpage>23</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>