=Paper=
{{Paper
|id=Vol-2350/xposter2
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-2350/xposter2.pdf
|volume=Vol-2350
|dblpUrl=https://dblp.org/rec/conf/aaaiss/VlietstraVMK19
}}
==None==
Using Predicate Information from a Knowledge Graph to Identify Disease Trajectories Vlietstra, W.J.1, Vos, R.1,2, van Mulligen, E.M.1, Kors, J.A.1 1 Department of Medical Informatics, Erasmus Medical Centre, Rotterdam, 3015 GE, the Netherlands 2 Department of Methodology & Statistics, Maastricht University, Maastricht, 6229 HA, the Netherlands w.vlietstra@erasmusmc.nl Introduction Method Knowledge graphs can represent the contents of biomedical The temporal disease trajectories as described by Jensen et literature and databases as subject-predicate-object triples, al. were used as a reference set (Jensen 2014). They ana- where predicates describe relationships between pairs of lyzed diagnoses in 6.2 million electronic patient records of biomedical entities. For example, the Reactome database the Danish population, assigned during 14.9 years, to iden- contains the triple “GTF2H2-controls the expression of- tify common disease trajectories. From these trajectories, MDC1”, and SemMedDB, which obtains its triples through we only used those that describe a sequence of two diseases. text-mining, contains the triple “IL1B-stimulates-MCP1”. A complementary, negative set of non-trajectories was con- By integrating triples from different sources with each other structed by creating random pairs of the diseases in the ref- in a knowledge graph, the comprehensive body of biomedi- erence set, as well as the reversed (incorrect) temporal se- cal knowledge can be computationally analyzed. quence of the trajectories in the reference set. Associations Analyses performed on knowledge graphs often aim to between proteins and diseases were obtained from the man- identify new relationships, e.g., between drugs and diseases, ually curated subset of DisGeNet (Piñero 2017). genes and phenotypes, or between diseases. However, from Three scenarios of paths between the disease proteins of large-scale observational studies we know that multiple dis- pairs of diseases were extracted from the knowledge graph: eases in patients are often diagnosed in specific temporal se- 1) Overlap, where two diseases A and B share the same quences, which are referred to as disease trajectories. Using disease protein. Optionally, this disease protein has a knowledge graphs to identify disease trajectories therefore relationship to itself, e.g., if it can homodimerize. requires both identifying the correct pair of diseases, as well 2) Direct path, where there is a triple of which one of as their correct temporal sequence. the disease proteins of disease A and one of the dis- Because protein networks are involved with metabolic, ease proteins of disease B form the subject and ob- signaling, immune, and gene-regulatory networks, they are ject. often used to mechanistically explain relationships between 3) Indirect path, where one intermediate protein con- diseases. So-called disease proteins, which are proteins nects the disease proteins of disease A and disease B, coded by genes associated with a disease, can be used to requiring a sequence of two triples. represent diseases on a protein level. However, until now Based on the predicates within these paths, six feature sets predicates between proteins have been rarely used, even were constructed. We compared two methods to represent though they, by describing the relationships between (dis- indirect relationships between disease proteins. The first ease) proteins, can provide additional information about the method constructs so-called metapaths (Himmelstein 2017), mechanism by which one disease can lead to another. We where the sequence of predicates in an indirect path is used therefore aim to exploit the predicate information from paths as a single feature. The second method considers each pred- between (disease) proteins in a knowledge graph to deter- icate in the indirect paths as a separate feature (Vlietstra mine whether a sequence of two diseases forms a trajectory. 2018). Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber, Engineering (AAAI-MAKE 2019). Stanford University, Palo Alto, Cali- D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of the AAAI 2019 fornia, USA, March 25-27, 2019. Spring Symposium on Combining Machine Learning with Knowledge Table 1 Classification results for the six feature sets when trained with balanced training sets. The values in the AUC columns indicate mean ROC-area under the curve values of 10 repeats of a 10-fold cross validation experiment, along with their standard deviation. Metapaths Split paths Number of features AUC Number of features AUC Undirected 1217 81.3% (1.4%) 168 76.0% (1.5%) Mixed 2823 87.9% (0.9%) 257 84.0% (1.1%) Directed 3773 88.1% (0.9%) 277 81.7% (1.5%) For both methods we experimented with three varia- performance. The use of prior knowledge to classify predi- tions of directional information of the predicates. Direc- cates as directed or undirected improves performance on tional information was never used when the same protein split path feature sets, but has no impact with metapath fea- was both subject and object of the triple (overlap sce- ture sets. Metapaths result in many more features than the nario). split paths, and consistently achieve a superior performance. 1) Undirected: triples forming direct and indirect paths As future work we intend to perform a detailed error anal- between disease proteins are used without infor- ysis, where we will investigate whether there are specific mation about which proteins are subject and object. diseases whose trajectories are frequently misclassified. The 2) Directed: Each triple, in each direct and indirect path International Classification of Diseases (ICD) hierarchy can between the disease proteins, has a direction as indi- be used to abstract diseases to a higher ICD level, thereby cated by its subject and object. obtaining insight into misclassifications at the level of dis- 3) Mixed: Each predicate in the direct and indirect paths ease classes. Abstracting the diseases in the trajectories also is classified as directed or undirected based on prior allows to examine whether specific combinations of ICD information. classes are more frequently misclassified. Results References Our reference set contained 2,530 trajectories and 168,870 Himmelstein, D.S., Lizee, A., Hessler, C., Brueggeman, L., Chen, non-trajectories. We used random forests to train a classifi- S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E. 2017. Systematic integration of biomedical knowledge prioritizes cation model. The cross-validated performance is shown in drugs for repurposing. In eLIFE, 6:1-35 Table 1, along with the number of features in the feature set. Delet Jensen, A.B., Moseley, P.L., Oprea, T.I., Ellesøe, S.G., Eriksson, Use of directional information of predicates substantially R., Schmock, H., Jensen, P.B., Jensen, L.J., Brunak, S. 2014. Tem- improved performance as compared to not using this infor- poral disease trajectories condensed from population-wide registry mation. However, disease trajectories could still be identi- data covering 6.2 million patients. In Nature Communications, fied with reasonable performance if only undirected infor- 5:4022 mation was used. Piñero, J., Bravo, À., Queralt-Rosinach, N., Gutiérrez-Sacristán, The metapath feature sets consisted of 7 to 14 times more A., Deu-Pons, J., Centeno, E., García-García, J., Sanz, F., Furlong, L.I. 2017. DisGeNET: a comprehensive platform integrating infor- features than the split-path feature sets, and achieved a su- mation on human disease-associated genes and variants. In Nucleic perior performance as compared to the split-path features. Acids Research, 45:833-839 The performance difference between the mixed and the di- Vlietstra, W.J., Vos, R., Sijbers, A.M., van Mulligen, E.M., Kors, rected metapath features was negligible. The performance J.A. 2018. Using predicate and provenance information from a of split features increased if prior knowledge about directed knowledge graph for drug efficacy screening. In Journal of Bio- or undirected predicates was taken into account. medical Semantics, 9:23 Discussion Our work demonstrates that disease trajectories can be iden- tified using predicate information from a protein knowledge graph. Our machine learning based classifier is capable of both identifying the correct pairs of diseases, as well as their correct temporal sequence. While the use of directional in- formation of triples in our analysis improved performance, even when no directional information is used our classifier can identify directed relationships with reasonable