Graph-KD: Exploring Relational Information for Knowledge Discovery Roland Roller1 , Gaurav Vashisth1 , Philippe Thomas1 , He Wang1 , Michael Mikhailov1 and Mark Stevenson2 1 DFKI, Berlin, Germany 2 University of Sheffield, Sheffield, England Abstract. This paper presents Graph-KD, a tool to navigate through large rela- tional knowledge sources. Graph-KD provides methods to understand relation- ships between concepts using open discovery, closed discovery and knowledge inference. The purpose of the tool is the support of biomedical knowledge dis- covery and exploration. It is primarily intended to be used by medical researchers and presents a use case involving millions of relations from UMLS. Graph-KD is able to process even large graphs efficiently and can be accessed via a web- interface (http://biomedical.dfki.de/graph-kd). 1 Introduction Relational knowledge bases and ontologies are rich sources of concepts and the rela- tionships between them which often consist of large amounts of information. These resources generally include information about directly related concepts but the infor- mation about those related indirectly can also be extremely valuable. Exploring this information can provide further insights and can help to discover new knowledge. Various tools exist to explore knowledge graphs such as UMLS. However, exist- ing tools either have a different focus or/and cover only parts of the functionalities of Graph-KD. For instance, UMLS:Similarity [4] is a tool to measure semantic similarity and relatedness based on UMLS and provides a shortest path functionality. This cov- ers only one aspect of Graph-KD and is only accessible via API. In k-neighborhood decentralization [8] methods for large scale knowledge discovery in context of UMLS are presented, which is related to the functionalities we provide, such as shortest path. Cantor et al. [2] offer a method to explore relationships between UMLS and the Gene Ontology. Using statistical and semantic relationships, it is possible to infer relation- ships between diseases and gene products. Gómez-Romero et al.[3] developed a Big Data graph processing and visualization pipeline in order to decrease processing time of large graphs. However, even if various tools exist to explore relational information of UMLS none of them provides easy-to-use functionalities techniques to explore and understand how long-range information are connected with each other. 2 Graph-KD Graph-KD provides functionalities to explore a large knowledge graph using open and closed discovery as well as knowledge inference. Open and closed discovery base on the Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 2 Roller et al. idea of literature-based discovery (LBD) which was introduced by Swanson [6]. The au- thor found that fish oil may have beneficial effects in patients with Raynaud syndrome, a fact which was not known beforehand. Swanson noticed that, although no connection between fish oil and Raynaud syndrome was known, they shared a number of common connections. It was known that fish oil lowers blood viscosity, inhibits platelet aggre- gation and causes vascular reactivity. Conversely, it was also known that patients with Raynaud syndrome have increased blood viscosity and platelet aggregation and suffer from impaired vascular reactivity [7]. One of the major problem of LBD is the enormous number of connections which effectively rules out checking all possibilities. Therefore, it is very important to concen- trate on the most relevant connections during the early stages of research. Graph-KD supports this process by providing useful information in an easily navigable format. The following functionalities are provided: Closed Discovery applies k-shortest path to find relevant connections between two con- cepts. Passing two concepts of interest to Graph-KB, the tool visualizes the connec- tions between them. In case various relation paths exist, all shortest paths will be dis- played. An example of the shortest paths between the class of pharmacologic substances (nsaids) and a particular disease (kidney disease) is presented in Figure 1. Fig. 1: Closed discovery between nsaids and kidney disease: There is no direct connection, how- ever, the graph shows for instance that kidney disease can be a contraindication for several drugs which belong to NSAID family. Open Discovery explores concepts and relations around a target concept. An example is provided in Figure 2. The graph shows the target concept antipyretics in combination with the target relation may-be-treated-by. The open discovery searches for the relation of interest around the target concept in closest distance. Since antipyretics is not linked to any may-be-treated-by relation in our example, the tool shows nodes in close distance which are connected via this relation. Knowledge Inference is a technique which takes existing facts into account and tries to make assumptions about unknown information. As knowledge graphs tend to be in- complete this can be a useful feature to support the open and closed discovery process. Graph-KD integrates a rule-based inference at this point. The backend of Graph-KD is written in python and has access to Neo4j, an open source NoSQL scalable graph database management system, which stores data in form of nodes and their corresponding typed edges. Using Neo4j’s built-in functionality the open and closed discovery can be executed with a good performance. Graph-KD: Exploring Relational Information for Knowledge Discovery 3 Fig. 2: Open discovery for node antipyretics and relation may-be-treated-by: As target node has no direct link to target relation, the graph shows nodes which connect to may-be-treated-by in closest distance. 3 Demo Use Case For the demo use case, Graph-KD explores information from UMLS, a large biomedi- cal knowledge base containing millions of medical terms and relations between them. UMLS defines medical concepts including their synonyms, and unifies them to a con- cept unique identifier (CUI). All those concepts are linked to at least one semantic type, such as Body Part, Finding or Clinical Drug. Moreover, UMLS defines relations be- tween concepts which include for instance isa, may-treat or contraindication-of. For our demo UMLS 2017AB was preprocessed and diminished. This included the removal of relations containing concepts related with itself, very general relations, in- verted relations and concepts with less meaningful semantic types. Namely, semantic types of the semantic groups GEOG, OBJC, OCCU and ORGA, according to Bodenrei- der and McGray [1], were removed. The resulting data of more than 3 million different CUIs and 9.5 million relations were then imported into Neo4j. The rule inference relies on transitivity rules, between hyponyms in combination with other relations (e.g. if A and B are related and B is a child of C, then we find a transitive relation between A and C). Replication of Existing Discoveries In order to show the benefit of our tool, we explore long range dependencies by replicating existing discoveries as presented in Preiss et al. [5]. As shown in Table 1 Graph-KD is able to replicate former discoveries, such as Raynaud disease and fish oil. In all cases UMLS does not contain any direct connection. However, using Graph-KD it is possible to explore and understand, how information are connected within the complete graph. In most cases the distance (D) is 3. Furthermore, the table shows, that for all two target concepts pairs a large number of different shortest paths can be found (see #). Runtime Table 2 shows the runtime for open and closed discovery. For both scenarios 200,000 randomly generated requests were sent via REST to the backend. The table shows that for approximately 8% (15,499) of all randomly selected CUI pairs a shortest path can be found (max distance 4). For those connections the average (mean) runtime is 0.15 seconds. The maximum response time is 175.02 seconds for closed and 0.51 for open discovery respectively. In addition to that, 75% of the requests are processed within less than a tenth of a second. 4 Roller et al. Discovery D# closed open Raynaud disease – Fish oil 3 127 discovery discovery Somatomedin C – Arginine 3 27 connections 15,499 6,793 Migraine disorders – Magnesium 4 471 mean 0.153920 0.006258 Magnesium deficiency – Neurologic disease 3 108 min 0.001387 0.001212 Alzheimer’s disease – Indomethacin 3 105 75% 0.060400 0.001599 Alzheimer’s disease – Estrogen 3 100 max 175.018601 0.512222 Schizophrenia – Calcium-I. Phospholipase A2 4 22 Table 2: REST runtime test for 200k Table 1: Exploring existing discoveries using Graph-KD random requests 4 Conclusion In this work we presented Graph-KD, a tool to explore large knowledge graphs. Graph- KD provides various functionalities for knowledge discovery and includes knowledge inference methods to gain further into the data. As our example in Table 1 showed, Graph-KD can be easily applied to support literature-based discovery. Moreover, other clinical use cases are possible in which physicians explore information in the knowledge graph in order to detect potential new links between medical concepts. Acknowledgments This project was funded by the European Union’s Horizon 2020 research and inno- vation program under grant agreement No 780495 (BigMedilytics) and by the German Federal Ministry of Economics and Energy through the project MACSS (01MD16011F). References 1. Bodenreider, O., McCray, A.T.: Exploring semantic groups through visual approaches. Journal of Biomedical Informatics 36(6), 414 – 432 (2003), unified Medical Language System 2. Cantor, M.N., Sarkar, I.N., Bodenreider, O., Lussier, Y.A.: Genestrace: phenomic knowledge discovery via structured terminology. In: Biocomputing, pp. 103–114. World Scientific (2005) 3. Gómez-Romero, J., Molina-Solana, M., Oehmichen, A., Guo, Y.: Visualizing large knowledge graphs: A performance analysis. Future Generation Computer Systems 89, 224 – 238 (2018) 4. McInnes, B., Pedersen, T., Pakhomov, S.: UMLS-Interface and UMLS-Similarity: Open Source Software for Measuring Paths and Semantic Similarity. In: Proceedings of the Ameri- can Medical Informatics Association (AMIA) Symposium. San Fransico, CA (2009) 5. Preiss, J., Stevenson, M., Gaizauskas, R.: Exploring relation types for literature-based discov- ery. Journal of the American Medical Informatics Association 22(5), 987–992 (05 2015) 6. Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspec- tives in biology and medicine 30(1), 7–18 (1986) 7. Weeber, M., Kors, J.A., Mons, B.: Online tools to support literature-based discovery in the life sciences. Briefings in bioinformatics 6(3), 277–286 (2005) 8. Xiang, Y., Lu, K., James, S.L., Borlawsky, T.B., Huang, K., Payne, P.R.: k-Neighborhood decentralization: a comprehensive solution to index the UMLS for large scale knowledge dis- covery. Journal of Biomedical Informatics 45(2), 323–336 (2012)