=Paper= {{Paper |id=Vol-2350/xposter2 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2350/xposter2.pdf |volume=Vol-2350 |dblpUrl=https://dblp.org/rec/conf/aaaiss/VlietstraVMK19 }} ==None== https://ceur-ws.org/Vol-2350/xposter2.pdf
                   Using Predicate Information from a Knowledge Graph
                              to Identify Disease Trajectories
                            Vlietstra, W.J.1, Vos, R.1,2, van Mulligen, E.M.1,                       Kors, J.A.1
                       1
                        Department of Medical Informatics, Erasmus Medical Centre, Rotterdam, 3015 GE, the Netherlands
                   2
                       Department of Methodology & Statistics, Maastricht University, Maastricht, 6229 HA, the Netherlands
                                                          w.vlietstra@erasmusmc.nl




                           Introduction                                                               Method
Knowledge graphs can represent the contents of biomedical                  The temporal disease trajectories as described by Jensen et
literature and databases as subject-predicate-object triples,              al. were used as a reference set (Jensen 2014). They ana-
where predicates describe relationships between pairs of                   lyzed diagnoses in 6.2 million electronic patient records of
biomedical entities. For example, the Reactome database                    the Danish population, assigned during 14.9 years, to iden-
contains the triple “GTF2H2-controls the expression of-                    tify common disease trajectories. From these trajectories,
MDC1”, and SemMedDB, which obtains its triples through                     we only used those that describe a sequence of two diseases.
text-mining, contains the triple “IL1B-stimulates-MCP1”.                   A complementary, negative set of non-trajectories was con-
By integrating triples from different sources with each other              structed by creating random pairs of the diseases in the ref-
in a knowledge graph, the comprehensive body of biomedi-                   erence set, as well as the reversed (incorrect) temporal se-
cal knowledge can be computationally analyzed.                             quence of the trajectories in the reference set. Associations
   Analyses performed on knowledge graphs often aim to                     between proteins and diseases were obtained from the man-
identify new relationships, e.g., between drugs and diseases,              ually curated subset of DisGeNet (Piñero 2017).
genes and phenotypes, or between diseases. However, from                      Three scenarios of paths between the disease proteins of
large-scale observational studies we know that multiple dis-               pairs of diseases were extracted from the knowledge graph:
eases in patients are often diagnosed in specific temporal se-                1) Overlap, where two diseases A and B share the same
quences, which are referred to as disease trajectories. Using                      disease protein. Optionally, this disease protein has a
knowledge graphs to identify disease trajectories therefore                        relationship to itself, e.g., if it can homodimerize.
requires both identifying the correct pair of diseases, as well               2) Direct path, where there is a triple of which one of
as their correct temporal sequence.                                                the disease proteins of disease A and one of the dis-
   Because protein networks are involved with metabolic,                           ease proteins of disease B form the subject and ob-
signaling, immune, and gene-regulatory networks, they are                          ject.
often used to mechanistically explain relationships between                   3) Indirect path, where one intermediate protein con-
diseases. So-called disease proteins, which are proteins                           nects the disease proteins of disease A and disease B,
coded by genes associated with a disease, can be used to                           requiring a sequence of two triples.
represent diseases on a protein level. However, until now                     Based on the predicates within these paths, six feature sets
predicates between proteins have been rarely used, even                    were constructed. We compared two methods to represent
though they, by describing the relationships between (dis-                 indirect relationships between disease proteins. The first
ease) proteins, can provide additional information about the               method constructs so-called metapaths (Himmelstein 2017),
mechanism by which one disease can lead to another. We                     where the sequence of predicates in an indirect path is used
therefore aim to exploit the predicate information from paths              as a single feature. The second method considers each pred-
between (disease) proteins in a knowledge graph to deter-                  icate in the indirect paths as a separate feature (Vlietstra
mine whether a sequence of two diseases forms a trajectory.                2018).

Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber,   Engineering (AAAI-MAKE 2019). Stanford University, Palo Alto, Cali-
D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of the AAAI 2019   fornia, USA, March 25-27, 2019.
Spring Symposium on Combining Machine Learning with Knowledge
  Table 1 Classification results for the six feature sets when trained with balanced training sets. The values in the AUC columns indicate
   mean ROC-area under the curve values of 10 repeats of a 10-fold cross validation experiment, along with their standard deviation.

                                              Metapaths                                      Split paths
                                 Number of features       AUC                    Number of features        AUC
                Undirected            1217            81.3% (1.4%)                     168             76.0% (1.5%)
                Mixed                 2823            87.9% (0.9%)                     257             84.0% (1.1%)
                Directed              3773            88.1% (0.9%)                     277             81.7% (1.5%)

     For both methods we experimented with three varia-                  performance. The use of prior knowledge to classify predi-
  tions of directional information of the predicates. Direc-             cates as directed or undirected improves performance on
  tional information was never used when the same protein                split path feature sets, but has no impact with metapath fea-
  was both subject and object of the triple (overlap sce-                ture sets. Metapaths result in many more features than the
  nario).                                                                split paths, and consistently achieve a superior performance.
  1) Undirected: triples forming direct and indirect paths                  As future work we intend to perform a detailed error anal-
       between disease proteins are used without infor-                  ysis, where we will investigate whether there are specific
       mation about which proteins are subject and object.               diseases whose trajectories are frequently misclassified. The
  2) Directed: Each triple, in each direct and indirect path             International Classification of Diseases (ICD) hierarchy can
       between the disease proteins, has a direction as indi-            be used to abstract diseases to a higher ICD level, thereby
       cated by its subject and object.                                  obtaining insight into misclassifications at the level of dis-
  3) Mixed: Each predicate in the direct and indirect paths              ease classes. Abstracting the diseases in the trajectories also
       is classified as directed or undirected based on prior            allows to examine whether specific combinations of ICD
       information.                                                      classes are more frequently misclassified.


                           Results                                                                 References
Our reference set contained 2,530 trajectories and 168,870               Himmelstein, D.S., Lizee, A., Hessler, C., Brueggeman, L., Chen,
non-trajectories. We used random forests to train a classifi-            S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E.
                                                                         2017. Systematic integration of biomedical knowledge prioritizes
cation model. The cross-validated performance is shown in                drugs for repurposing. In eLIFE, 6:1-35
Table 1, along with the number of features in the feature set.                                                                                  Delet
                                                                         Jensen, A.B., Moseley, P.L., Oprea, T.I., Ellesøe, S.G., Eriksson,
Use of directional information of predicates substantially               R., Schmock, H., Jensen, P.B., Jensen, L.J., Brunak, S. 2014. Tem-
improved performance as compared to not using this infor-                poral disease trajectories condensed from population-wide registry
mation. However, disease trajectories could still be identi-             data covering 6.2 million patients. In Nature Communications,
fied with reasonable performance if only undirected infor-               5:4022
mation was used.                                                         Piñero, J., Bravo, À., Queralt-Rosinach, N., Gutiérrez-Sacristán,
   The metapath feature sets consisted of 7 to 14 times more             A., Deu-Pons, J., Centeno, E., García-García, J., Sanz, F., Furlong,
                                                                         L.I. 2017. DisGeNET: a comprehensive platform integrating infor-
features than the split-path feature sets, and achieved a su-            mation on human disease-associated genes and variants. In Nucleic
perior performance as compared to the split-path features.               Acids Research, 45:833-839
The performance difference between the mixed and the di-                 Vlietstra, W.J., Vos, R., Sijbers, A.M., van Mulligen, E.M., Kors,
rected metapath features was negligible. The performance                 J.A. 2018. Using predicate and provenance information from a
of split features increased if prior knowledge about directed            knowledge graph for drug efficacy screening. In Journal of Bio-
or undirected predicates was taken into account.                         medical Semantics, 9:23


                         Discussion
Our work demonstrates that disease trajectories can be iden-
tified using predicate information from a protein knowledge
graph. Our machine learning based classifier is capable of
both identifying the correct pairs of diseases, as well as their
correct temporal sequence. While the use of directional in-
formation of triples in our analysis improved performance,
even when no directional information is used our classifier
can identify directed relationships with reasonable