Reflections on: Finding Melanoma Drugs Through a Probabilistic Knowledge Graph James P. McCusker1 , Michel Dumontier4 , Rui Yan1 , Sylvia He1 , Jonathan S. Dordick2,3 , and Deborah L. McGuinness1 , 3 1 Department of Computer Science, 2 Department of Chemical & Biological Engineering, 3 Center for Biotechnology & Interdisciplinary Studies, Rensselaer Polytechnic Institute, Troy, NY, US 4 Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, US Abstract. We build a nanopublication-based knowledge graph of pro- tein/protein, drug/protein, and protein/disease interactions to create a resource for exploring potential therapies for diseases. This is accom- plished using Semantic Web standard tools like Blazegraph, SADI web services, and JSON-LD for integration with a Javascript-based web client. Metastatic cutaneous melanoma is an aggressive skin cancer with some progression-slowing treatments but no known cure. The omics data ex- plosion has created many possible drug candidates, however filtering cri- teria remain challenging, and systems biology approaches have become fragmented with many disconnected databases. Using drug, protein, and disease interactions, we built an evidence-weighted knowledge graph of integrated interactions. Our knowledge graph-based system, ReDrugS, can be used via an API or web interface, and has generated 25 high quality melanoma drug candidates. We show that probabilistic analysis of systems biology graphs increases drug candidate quality compared to non-probabilistic methods. Four of the 25 candidates are novel therapies, three of which have been tested with other cancers. All other candidates have current or completed clinical trials, or have been studied in in vivo or in vitro. This approach can be used to identify candidate therapies for use in research or personalized medicine. Keywords: melanoma, drug repositioning, knowledge graphs, uncertainty rea- soning 1 Introduction Metastatic cutaneous melanoma is an aggressive cancer of the skin with low prevalence but very high mortality rate, with an estimated 5 year survival rate of 6 percent [1] There are currently no known therapies that can consis- tently cure metastatic melanoma. Vemurafenib is effective against BRAF mu- tant melanomas [2] but resistant cells often result in recurrence of metastases Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Authors Suppressed Due to Excessive Length [8] Melanoma itself may be best approached based on the individual genetics of the tumor, as it has been shown to involve mutations in many different genes to produce the same disease [7]. Because of this, an individualized approach may be necessary to find effective treatments. A knowledge graph is a compilation of facts and figures that can be used to provide contextual meaning to searches. Google is using knowledge graphs to improve its search and to analyze the information graph of the web; Facebook is using them to analyze the social graph. We built our knowledge graph with the goal of unifying large parts of biomedical domain knowledge for both mining and interactive exploration related to drugs, diseases, and proteins. Our knowledge graph is enhanced by the provenance of each fragment of knowledge captured, which is used to compute the confidence probabilities for each of those fragments. Further, we use open standards from the World Wide Web Consortium (W3C), including the Resource Description Framework (RDF) [6], Web Ontology Lan- guage (OWL) [12], and SPARQL [4]. The representation of the knowledge in our knowledge graph is aligned with best practice vocabularies and ontologies from the W3C and the biomedical community, including the PROV Ontology [9], the HUPO Proteomics Standards Initiative Molecular Interactions (PSI-MI) Ontol- ogy [5], and the Semanticscience Integrated Ontology (SIO) [3]. Use of these standards, vocabularies, and ontologies make it simple for ReDrugS to integrate with other similar efforts in the future with minimal effort. We built a novel computational drug repositioning platform, that we refer to as ReDrugS, that applies probabilistic filtering over individually-supported assertions drawn from multiple databases pertaining to systems biology, phar- macology, disease association, and gene expression data. We use our platform to identify novel and known drugs for melanoma. 2 Results We used ReDrugS to examine the drug-target-disease network and identify known, novel, and well supported melanoma drugs. The ReDrugS knowledge base contained 6,180 drugs, 3,820 diseases, 69,279 proteins, and 899,198 interac- tions. We examined drug and gene connections that were 3 or less interaction steps from melanoma, and additionally filtered interactions with a joint probability greater or equal to 0.93. We identified 25 drugs in the resulting drug-gene-disease network surrounding melanoma as illustrated in Figure 1 . We then validated the set of 25 drugs by determining their position in the drug discovery pipeline for melanoma. Nearly all drugs uncovered by ReDrugS were previously been identified as potential melanoma therapies either in clinical trials or in vivo or in vitro. Of the 25 drugs, 12 have been in Phase I, II, or III clinical trials, 5 have been studied in vitro, 4 in vivo, 1 was investigated as a case study, and 3 are novel. To further evaluate our system, we examined the impact of decreasing the joint probability or increasing the number of interaction steps. Figures 2 A and Title Suppressed Due to Excessive Length 3 Fig. 1. The interaction graph of predicted melanoma drugs with a probability of 0.93 or higher and have three or fewer intervening interactions between drug and disease. The “Explore” tab contains the controls to expand the network in various ways, including the filtering parameters. Node and edge detail tabs provide additional information about the selected node or edge, including the probabilities of the edges selected. Users can control the layout algorithm and related options using the “Options” tab. B show precision, recall, and f-measure curves while varying each parameter. Using these information retrieval performance curves we found that using a joint probability of 0.93 or greater with 3 or less interaction steps maximizes the precision and recall as shown in Figure 2. By performing a literature search on hypothesis candidates with a joint prob- ability of 0.5 or higher and 6 or fewer interaction steps, we were able to generate precision, recall, and f-measure curves for both cutoffs to find our cutoff of 0.93 with 3 or fewer interaction steps. The precision, recall, and f-measure curves are shown for varying joint probability thresholds in Figure 2 A and for varying interaction step counts in Figure 2 B. 3 Discussion We designed ReDrugS to quickly and automatically integrate and filter a het- erogeneous biomedical knowledge graph to generate high-confidence drug reposi- tioning candidates. Our results indicate that ReDrugs generates clinically plau- 4 Authors Suppressed Due to Excessive Length (A) Information Retrieval by Probability Threshold (B) Information Retrieval by Network Expansion Step precision recall f­measure precision recall f­measure 0.75 0.75 0.5 0.5 0.25 0.25 0.9 0.8 0.7 0.6 3 4 5 Fig. 2. Precision, recall, and f-measure by (A) varying thresholds for joint probability and (B) varying number of interaction steps. Precision is the percentage of returned candidates that have been validated experimentally or have been in a clinical trial (a “hit”) versus all candidates returned. Recall is the percentage of all known validated “hits”. F-measure is the geometric mean of precision and recall that provides a balanced evaluation of the quality and completeness of the results. sible drug candidates, in which half are in various stages of clinical trials, while others are novel or are being investigated in pre-clinical studies. By helping to consolidate the three main datatypes - drug targets, protein interactions, and disease genes, ReDrugs can amplify the ability of researchers to filter the vast amount of information into those that are relevant for drug discovery. 3.1 Architecture ReDrugS uses a fairly straightforward web architecture, as shown in Figure 3. It uses the Blazegraph RDF database backend. The database layer is inter- changeable except that the full text search service needs to use Blazegraph-only properties to perform text searches as text indexing is not yet standardized in the SPARQL query language. All other aspects are standardized and should work with other RDF databases without modification. ReDrugs currently uses the Python-based TurboGears web application framework hosted using the Web Services Gateway Interface (WSGI) standard via an Apache HTTP server. Tur- boGears in turn hosts the SADI web services that drive the application and access the database. It also serves up the static HTML and supporting files. The user interface is implemented with AngularJS and Cytoscape.js, which submits queries to the SADI web services using JSON-LD and aggregates results into the networked view. The software relies exclusively on standardized proto- cols (HTTP, SADI, SPARQL, RDF, and others) to make it simple to replace technologies as needed. The data itself is processed using conversion scripts as shown in Figure 4. Title Suppressed Due to Excessive Length 5 Javascript Web Client Cytoscape.js JSON-LD /api/search /api/upstream /api/downstream Python + Apache Web Server SPARQL RDF Store Fig. 3. The ReDrugS software architecture. Using web standards and a three layer architecture (RDF store, web server, and rich web client), we were able to build a complete knowledge graph analysis platform. 4 Materials and Methods This research project did not involve human subjects. The ReDrugS platform consists of a graphical web application, an application programming interface (API), and a knowledge base. The graphical web application enables users to initiate a search using drug, gene, and disease names and synonyms. Users can then interact with the application to expand the network at an arbitrary number of interactions away from the entity of interest, and to filter the network based on a joint probability between the source and target entities. Drug-protein, protein- protein, and gene-disease interactions were obtained from several datasets and integrated into ontology-annotated and provenance and evidence bearing rep- resentations called nanopublications. The web application obtains information from the knowledge base using semantic web services. Finally, we evaluated our approach by examining the mechanistic plausibility of the drug in having melanoma-specific disease modifying ability. We evaluated a large number of pos- sible drug/disease associations with varying joint probabilities and interaction steps to determine the thresholds with the highest F-Measure, resulting in our thresholds of three or less interactions and a joint probability of 0.93 or higher. 4.1 Semantic Web Services We developed four Semantic Automated Discovery and Integration (SADI) web services [13] in Python to support easy access to the nanopubications (see Table 1) in ReDrugS. The four services are enumerated in Table 1. 6 Authors Suppressed Due to Excessive Length Ontological Resources Protein/Protein Interaction Ontology, Semanticscience Integrated Ontology, Experimental Gene Ontology Method vocabularies, relationships Assessment evidence to Confidence scores of probability iRefIndex converted to nanopubs experimental methods. ReDrugS RDF Store queries ReDrugS API Interaction network search and expansion queries graph queries graph Analytical Tools ReDrugS Cytoscape, R, Python, etc. Cytoscape.js App Fig. 4. The ReDrugS data flow. Data is selected from external databases and converted using scripts into nanopublication graphs, which are loaded into the ReDrugS data store. This is combined with experimental method assessments, expressed in OWL, and public ontologies into the RDF store. The web service layer queries the store and produces aggregate analyses of those nanopublications, which is consumed and displayed by the rich web client. The same APIs can be used by other tools for further analysis. The first service is a simple free text lookup, that takes an pml:Query 5 [10] with a prov:value as a query and produces a set of entities whose labels contain the substring. This is used for interactive typeahead completion of search terms so users can look up URIs and entities without needing to know the details. The other three SADI services look up interactions that contain a named entity. Two of them look at the entity to find upstream and downstream con- nections, and the third service assumes that the entity is a biological process and finds all interactions that related to that process. The services return only one interaction for each triple (source, interaction type, target). There are of- ten multiple probabilities per interaction, and more than one interaction per interaction type. This is because the interaction may have been recorded in mul- tiple databases, based on different experimental methods. To provide a single probability score for each interaction of a source and target, the interactions are combined. A single probability is generated per identified interaction by tak- ing the geometric mean of the probabilities for that interaction. However, this 5 PML 3, in development: https://github.com/timrdf/pml. This includes PML 2 con- structs that are not covered in PROV-O. Title Suppressed Due to Excessive Length 7 Service Description URL Input Output Name Resource Look up resources using search pml:Query pml:AnsweredQuery text tearch free text search against their RDFS labels. This service is optimized for ty- peahead user interfaces. Find inter- Find interactions whose process sio:Process sio:Process actions in participants or targets also a biological participate in the input process process. Find up- Find interactions that the upstream sio:MaterialEntity sio:Target stream input entity is a target of participants in and have explicit partic- ipants. Find down- Find interactions that the downstream sio:MaterialEntity sio:Agent stream tar- input entity participates in gets and have explicit targets. Table 1. The API endpoint prefix is http://redrugs.tw.rpi.edu/api/. method is undesirable when combining multiple interaction records of the same type. We instead combine the interaction records using a form of probabilistic voting using composite Z-Scores. This is done to model that multiple experi- ments that produce the same results reinforce each other, and should therefore give a higher overall probability than would be indicated by taking their mean or even by Bayes Theorem. We do this by converting each probability into a Z Score (aka Standard Score) using the Quantile Function (Q()), summing the val- ues, and applying the Cumulative Distribution Function (CDF ()) to compute the corresponding probability: n ! X P (x1...n ) = CDF Q (P (xi )) i=1 These composite Z Scores, which we transform back into probabilities, are frequently used to combine multiple indicators of the same underlying phenom- ena, as in [11]. 4.2 User Interface The user interface was developed using the above SADI web services and uses Cytoscape.js,6 , angular.js,7 and Bootstrap 3.8 An example network is shown 6 http://cytoscape.github.io/cytoscape.js 7 https://angularjs.org 8 http://getbootstrap.com 8 Authors Suppressed Due to Excessive Length in Figure 1 Users can search for biological entities and processes, which can then be autocompleted to specific entities that are in the ReDrugS graph. Users can then add those entities and processes to the displayed graph and retrieve upstream and downstream connections and link out to more details for every entity. Cytoscape.js is used as the main rendering and network visualization tool, and provides node and edge rendering, layout, and network analysis capabilities, and has been integrated into a customized rich web client. In order to evaluate this knowledge graph, we developed a demonstration web interface9 based on the Cytoscape.js10 JavaScript library. The interface lets users enter biological entity names. As the user types, the text is resolved to a list of entities. The user finishes by selecting from the list, and submitting the search. The search returns interactions and nodes associated with the entity selected, which are added to the Cytoscape.js graph. Users are also able to select nodes and populate upstream or downstream connections. Figure 1 is an example output of this process. References 1. Barth, A., Wanek, L., Morton, D.: Prognostic factors in 1,521 melanoma patients with distant metastases. J Am Coll Surg 181, 193–201 (Sep 1995) 2. Chapman, P.B., Hauschild, A., Robert, C., Haanen, J.B., Ascierto, P., Larkin, J., Dummer, R., Garbe, C., Testori, A., Maio, M., Hogg, D., Lorigan, P., Lebbe, C., Jouary, T., Schadendorf, D., Ribas, A., O’Day, S.J., Sosman, J.A., Kirk- wood, J.M., Eggermont, A.M., Dreno, B., Nolop, K., Li, J., Nelson, B., Hou, J., Lee, R.J., Flaherty, K.T., McArthur, G.A.: Improved Survival with Vemu- rafenib in Melanoma with BRAF V600E Mutation. New England Journal of Medicine 364(26), 2507–2516 (jun 2011). https://doi.org/10.1056/nejmoa1103782, http://dx.doi.org/10.1056/NEJMoa1103782 3. Dumontier, M., Baker, C.J., Baran, J., Callahan, A., Chepelev, L., Cruz-Toledo, J., Del Rio, N.R., Duck, G., Furlong, L.I., Keath, N., Klassen, D., McCusker, J.P., Queralt-Rosinach, N., Samwald, M., Villanueva-Rosales, N., Wilkinson, M.D., Hoehndorf, R.: The semanticscience integrated ontology (sio) for biomed- ical research and knowledge discovery. Journal of Biomedical Semantics 5(1), 14 (2014). https://doi.org/10.1186/2041-1480-5-14, http://dx.doi.org/10.1186/ 2041-1480-5-14 4. Harris, S., Seaborne, A., Prudhommeaux, E.: SPARQL 1.1 query language. W3C Recommendation 21 (2013) 5. Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski, L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Mering, C., Roechert, B., Poux, S., Jung, E., Mersch, H., Kersey, P., Lappe, M., Li, Y., Zeng, R., Rana, D., Nikolski, M., Husi, H., Brun, C., Shanker, K., Grant, S.G.N., Sander, C., Bork, P., Zhu, W., Pandey, A., Brazma, A., Jacq, B., Vidal, M., Sherman, D., Legrain, P., Cesareni, G., Xenarios, I., Eisenberg, D., Steipe, B., Hogue, C., Apweiler, R.: The hupo psi’s molecular interaction formata community standard for the representation of protein interaction data. Nature biotechnology 22(2), 177–183 (2004) 9 http://redrugs.tw.rpi.edu 10 http://cytoscape.github.io/cytoscape.js Title Suppressed Due to Excessive Length 9 6. Klyne, G., Carroll, J.J.: Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation (2005) 7. Krauthammer, M., Kong, Y., Bacchiocchi, A., Evans, P., Pornputtapong, N., Wu, C., McCusker, J., Ma, S., Cheng, E., Straub, R., Serin, M., Bosenberg, M., Ariyan, S., Narayan, D., Sznol, M., Kluger, H., Mane, S., Schlessinger, J., Lifton, R., Hala- ban, R.: Exome sequencing identifies recurrent mutations in NF1 and RASopathy genes in sun-exposed melanomas. Nat Genet 47, 996–1002 (Sep 2015) 8. Le, K., Blomain, E.S., Rodeck, U., Aplin, A.E.: Selective RAF inhibitor im- pairs ERK1/2 phosphorylation and growth in mutant NRAS vemurafenib- resistant melanoma cells. Pigment Cell Melanoma Res 26(4), 509–517 (apr 2013). https://doi.org/10.1111/pcmr.12092, http://dx.doi.org/10.1111/pcmr.12092 9. Lebo, T., Sahoo, S., McGuinness, D.: PROV-O: The PROV Ontology. http:// www.w3.org/TR/prov-o/ (2013) 10. McGuinness, D.L., Ding, L., Silva, P.P.D., Chang, C.: PML 2: A Modular Expla- nation Interlingua. In: Proceedings of the AAAI 2007 Workshop on Explanation- Aware Computing. pp. 22 – 23 (2007), http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.186.8633 11. Moller, J., Cluitmans, P., Rasmussen, L., Houx, P., Rasmussen, H., Canet, J., Rabbitt, P., Jolles, J., Larsen, K., Hanning, C., Langeron, O., Johnson, T., Lauven, P., Kristensen, P., Biedler, A., van Beem, H., Fraidakis, O., Silverstein, J., Beneken, J., JS, G.: Long-term postoperative cognitive dysfunction in the elderly: ISPOCD1 study. The Lancet 351(9106), 857–861 (Mar 1998). https://doi.org/10.1016/s0140- 6736(97)07382-0, http://dx.doi.org/10.1016/S0140-6736(97)07382-0 12. Motik, B., Patel-Schneider, P.F., Cuenca Grau, B.: OWL 2 Web On- tology Language: Direct Semantics (2009), http://www.w3.org/TR/2009/ REC-owl2-direct-semantics 13. Wilkinson, M., Vandervalk, B., McCarthy, L.: SADI Semantic Web Services - ’cause you can’t always GET what you want! In: 2009 IEEE Asia-Pacific Ser- vices Computing Conference (APSCC). Institute of Electrical & Electronics En- gineers (IEEE) (dec 2009). https://doi.org/10.1109/apscc.2009.5394148, http: //dx.doi.org/10.1109/APSCC.2009.5394148