Machine Learning based Drug Indication Prediction using Linked Open Data Remzi Çelebi1 , Özgün Erten2 , and Michel Dumontier3 1 Ege University Computer Engineering Department, Izmir, Turkey, remzi.celebi@ege.edu.tr, 2 Ege University Faculty of Medicine, Izmir, Turkey, ozgun.erten@med.ege.edu.tr, 3 Institute of Data Science, Maastricht University, Maastricht, Netherlands, michel.dumontier@maastrichtuniversity.nl Abstract. In this study, drug and disease features were obtained by querying open linked data to train our classifier for predicting new drug indications, and the predictive performance of the classifier for different validation schemes was evaluated. We collected the drug and disease data from Bio2RDF, an open source project that uses semantic web technologies to link data from multiple sources. A binary feature matrix was generated using drug target, substructure and side effects and disease ontology terms. We collected a broader collection of data containing 816 drugs and 1393 diseases with their features and gold standard data we generated by combining multiple drug indication data sources. We tried our method on a different dataset, compiled by other researchers, that confirmed the predictive value of our method independent of the primary data. A crucial flaw in the typical evaluation scheme for drug indication pre- dictions that would yield unrealistic predictions is to fail to consider the paired nature of inputs. We partitioned the data in distinct train- ing and test sets where not only pairs but also drugs/diseases are were not overlapped. We tested several classifiers under different cross vali- dation schemes and compared our approach with existing methods. We observed that our model had better predictive performance than the existing models in disjoint cross-validation settings. Keywords: linked open data, SPARQL, drug repositioning, machine learning, drug indication prediction 1 Introduction Despite genomic and technological advances, drug discovery and development continues to be a time-consuming and costly process. The number of approved new drugs has remained far below expectations, notwithstanding substantial investments in the pharmaceutical and health sciences. Therefore, one attractive option is to reduce the time and cost of drug development by expanding the scope of usage of already approved, known drugs. Given that these drugs have passed stringent approvals by US Food and Drug Administration, there is minimal risk associated with their safety and tolerability. Drug repositioning can dramatically reduce development times and costs from the discovery to the clinical approval stages. Between 20 and 30 scientific papers on the subject of drug repositioning are published each month [1]. Given this data, the importance of repositioned drugs on the market is highlighted by the fact that they account for 30% of new indications per year. Previous efforts to estimate large-scale novel drug indications have focused on the mapping of gene expression profiles [10, 8] and on the recommendation of similar drugs or diseases based on known drug-disease relationships [4]. Ma- chine learning has a significant advantage over other methods, by offering a way in which to optimally combine different drug and disease characteristics into a predictive model. It may also reveal important features that allow for identifying promising drug indications. Machine learning based drug indication prediction studies have used various similarity measures such as chemical structure, side- effect, protein target information. One such approach is the PREDICT method by [5]. In this method, 5 drug-drug similarity and 2 disease-disease similarity measures were used to train a logistic classifier to predict potential drug-disease association. Zhang and colleagues [23] proposed k-nearest neighbor approach ( Similarity-based LArge-margin learning of Multiple Sources (SLAMS)) to pre- dict novel drug indications by calculating the combined similarity score with the drug data obtained from different sources. Guney [6] developed an open-source software tool for researchers to repeat this work and made it public. Since machine learning generally treats drug indication prediction as a binary classification problem, it is necessary to specify the known drug indications (pos- itive set) and the drug-disease pairs with no indications (negative set). Although the indications in the positive set are usually previously known, the results of clinical trials in which drugs have failed are often not reported. A recent attempt aimed to provide a gold standard database, repoDB [1], that also contains failed drug-indications by retrieving clinical trial records from AACT database 4 . But the number of reported failed indications are far less than number of true indi- cations. In this study, drug and disease features were obtained by querying open data to train our classifier for predicting new drug indications, and the predictive performance of the classifier for different validation schemes was evaluated. We compared our method with previous computational drug indication prediction approaches. We observed that we had better predictive performance than the PREDICT and the SLAMS in disjoint cross-validation settings. Tests and pre- dictions data generated by combining multiple drug indications data sources were evaluated. Finally, we make our work open and freely available so that others can use or extend this methodology 5 . 4 https://www.ctti-clinicaltrials.org/aact-database 5 https://github.com/rcelebi/drugindication ml 2 Method We developed a computational pipeline to reproduce the data and the results of our methodology. The pipeline consists of following steps: 1- Query and download open drug and disease data sets 2- Extract features from data sets 3- Select negative samples and balance the proportion of positive and negative samples that will be introduced into the classifier 4- Apply cross- validation 5- Build classifiers 2.1 Data Compiled from Linked Open Data Machine learning models to predict drug indications were trained using drug and disease featured extracted from open data. Most studies use features already curated for drug repurposing. This study generated features obtained from querying repositories of linked data. Linked Data refers to data sources that use Semantic Web technologies to make struc- tured content available on the web. In following the principles of Linked Data, these resources become more FAIR - Findable, Accessible, Interoperable, and Reusable [21]. One key resource for the biomedical sciences on the Semantic Web is Bio2RDF [2], an open source project that uses semantic web technologies to construct and make available a network of linked data from several major bio- logical databases, including Drugbank, KEGG and SIDER. We used Bio2RDF to obtain raw data which were subsequently processed to generate the features for our learning model (Figure 1). Fig. 1. Visualization of the semantic graph for Clozapine in the subset of Bio2RDF. Bio2RDF normalized and integrated drug data from different data sources in semantic space. We wrote and executed SPARQL queries to obtain 816 drugs and their tar- gets from DrugBank and KEGG dataset, the chemical structure information of these drugs from DrugBank dataset, the side effect information from SIDER and diseases MedDRA concepts from BioPortal (Noy et al. 2009). In case of a version update to the data, it will be possible to re-execute the queries and obtain new updated data. We normalized the data using Drugbank identifiers for drugs, NCBI gene identifiers for drug targets and diseases, while side effects were mapped to UMLS identifiers so as to integrate various terminologies. We obtained drug-disease associations from DrugCentral and The National Drug File Reference Terminology (NDF-RT) repositories. Drug Central con- tained a total of 6677 drug-disease relationships consisting of 1519 drugs and 1229 diseases. NDF-RT contains 2998 drug indications spanning 782 drugs and 737 diseases that have direct mappings to Drugbank, UMLS concepts respec- tively. After assembling drug-disease associations, a unified gold standard has 8951 drug indications, 1594 drugs and 1611 diseases (see Table 1). Only 4715 drug-disease associations were used in the experiments where the features could be generated for only 788 drugs and 1103 diseases in the unified gold standard. Table 1. Statistics about NDF-RT, DrugCentral and unification of two gold standards. DrugCentral NDF-RT Common Unified Drug 1519 782 707 1594 Disease 1229 737 355 1611 Drug-Disease Association 6677 2998 724 8951 2.2 Extracting Features Chemical structure Drug structure at the molecular level describes its binding activity. Chemical fingerprints are the most commonly used structural profiling marker for drugs [13]. Fingerprints are bit vectors that indicate the presence (1) or absence (0) of certain chemical features (e.g. a C=N group, a six member ring, ). We used the OpenBabel 2.3 library to take an input chemical formula (SMILES ID) and generate Molecular Access System (MACCS) binary structural feature lists with lengths of 166. Drug targets The set of targets for a drug can shed light on affected biological processes. We represent the set of drug targets obtained from DrugBank and KEGG as a bit vector in which 1 represents a target of the drug, and a 0 represents not a target for the drug. This results in a sparse matrix, since the drugs have a median of one putative target each. Drug Side Effects Side effects elicited by drugs are suggestive of a physiological role. Previous studies have used used side effects to estimate drug similarity, despite the potential noise in labeling [3, 22]. We used SIDER [9] as a source of drug side-effect information. SIDER was automatically constructed by text mining of drug product labels and are known to contain false positives. Disease Description A drug can be indicated for a greater number of dis- eases than its original indications. In order to be gain information about these situations, it is necessary to produce profiles that describe the level of similarity between diseases. In order to produce a disease profile, we used top-level concepts that the disease shared on an ontology. We obtained NDF-RT and MedDRA on- tologies from BioPortal to define a disease with its top-level concepts. If a disease is present in a ontology, the top concepts associated with this concept represent 1 (existence) or 0 (absence) in the feature vector. 2.3 Selecting and Balancing Negative Samples We tested the strategy for the selection of negative examples that was conducted by means of random selection of negative cases from among unknown drug- disease associations within the diseases at least one drug indicated for. The negative set is randomly selected from unknown drug-disease associations in some proportion to the number of pairs within the the positive set. The user can input the proportion of the positive and negative samples within each fold. 2.4 Evaluation Existing studies generally predict that the drugs in the test set will also be in the training set. However, researchers are more interested in discovering a drug whose indications are unknown, so the evaluation established in this way can give misleading information about the prediction of indications for new drugs. Guney [6] examined the situation where the drugs in the test set are disjoint from the drugs in training set. We have expanded Guney’s drug-wise cross-validation approach to include disease-wise cross-validation as well (see Figure 2). Thus, prediction performance changes were observed in the samples in the test set differed from those in the training set, in which they have no common drugs or diseases. We used these different cross-validation schemes to see how reliable our estimates are for a drug or disease that is not in the training set. 2.5 Building Classifiers We used Python Scikit-learn machine learning package to build the classifiers. Various classifiers were constructed with logistic regression (LR), k-nearest neigh- bor classifier (KNN), random forest (RF), and gradient boosting classifier (GBC). The parameters for building different classifiers were chosen as follows: L2 penalty and C = 1.0 for LR; n neighbors = 5 for KNN; n estimators = 1000 and max depth = 5 for RF and GBC. We implemented approaches for data balanc- ing, cross validation and classifier building. Fig. 2. Graphical representation of training-test split of a toy data. In this example c1, c2, c3, and c4 correspond to four compounds, and p1 and p2 correspond to two phe- notypes. A known drug indication between a compound and phenotype is represented as the edge of the graph. Since two-fold cross-validation was used in this example, two groups were separated in terms of drug or disease. In the case of a) drug and b) disease and related associations are grouped into training and test data. 3 Results We first compared our approach with the SLAMS method using NDF-RT gold standard and the data already curated, available online through Guneys tool. By trying the same data used in the SLAMS method we wanted to show the predictability of our method independent of the data compiled. The NDF-RT version that they used contained a total of 3250 drug relationships between 799 drugs and 719 diseases. We observed best AUC = 87.10% for the NDFRT gold standard using Gradient Boosting Tree Classifier with pair-wise cross-validation (see Table 2). AUC fell to 82.77% in drug-wise cross-validation ( no two drugs are not in the same fold) . Here, the number of negatives samples chosen was twice as large as the positive set. In comparison, the SLAMS could yield best AUC =84.65% with Logistic Regression for in pair-wise cross-validation and AUC =68.43% in drug-wise cross-validation. It shows us there is a huge improve- ment in prediction performance for drug-wise (68.43% to 82.77%) and pair-wise (84.65% to 87.10%). We next examined the prediction performance of our method with the unified indication gold standard and the data compiled from open linked data. Figure 3 shows the AUC for different classifiers under various validation schemes averaged over ten runs of ten-fold cross validation. We observed Gradient Boosting Clas- sifier (GBC) has significant prediction performance over other ML methods with AUC of 0.88. Under both drug-wise and disease-wise cross validation schemes, Table 2. Areas under ROC curves (AUC) under drug-wise and pair-wise cross- validation averaged over ten runs of ten-fold cross validation for NDF-RT gold standard. Our Method SLAMS Model Drug disjoint AUC AUC no 73.03 ± 2.00 84.65 ± 0.19 LR yes 69.46 ± 3.79 68.43 ± 0.87 no 82.88 ± 1.69 82.78 ± 0.40 RF yes 76.79 ± 3.78 65.27 ± 0.82 no 70.81 ± 2.26 81.83 ± 0.82 KNN yes 70.52 ± 4.22 65.44 ± 0.75 no 87.10 ± 1.53 84.22 ± 0.36 GB yes 82.77 ± 3.37 67.82 ± 0.74 GBC was better than other ML algorithms and did not fall below AUC score of 0.83. When considering drug-wise disjoint cross-validation scheme, SLAMS ob- tains AUC score of 0.66 with logistic regression at best. Another observation is PREDICT with the drug and disease similarities using the same data obtains an averaged AUC score of 0.72 with logistic regression under the same scheme. 3.1 Analysis of Novel Predictions To evaluate the predictive power of our method, we investigate the predictions made by our tool for drug Reboxetine. Reboxetine is an antidepressant effective drug in the selective noradrenaline reuptake inhibitor (SNARI) group used in the treatment of depression with high affinity for the carrier of noradrenaline, which selectively inhibits noradrenaline reuptake in the presynaptic range. Reboxetine has only one indication (Major Depressive Disorder) specified in our gold standard. In the light of current literature, Reboxetine is also suggested as an effective and safe option for the treatment of depression, sleep disorders [11, 17], eating disorders [7, 18], attention deficit hyperactivity disorder (ADHD) [19, 16], panic attack [20], depression in parkinsonian patients [12]. The estimates we made for potential indications of this drug are given in the Table 3. The probabilities for the potential indications for the logistic classifier Re- boxetine were given in Table 3 and the average probability of 17 indications are 0.65. For the first 15 diseases, the probability is greater than 0.5 and it is understood that the indication is likely. Our model predicts that among all dis- eases, 200 diseases may be associated with Reboxetine (P > 0.5). In addition to the reasonable estimates such as Hypertensive disease (P = 0.937) and Allergic rhinitis (P = 0.986), which need to supported by evidence from the literature. Fig. 3. Areas under ROC (AUC) under various validation schemes averaged over ten runs of ten-fold cross validation for the unified gold standard. 4 Conclusion Researchers have exploited publicly accessible datasets to validate their hypothe- ses for prediction of drug indications. However, the datasets are diverse and are subject to change over time, which may result in different conclusions for the same hypotheses. We used Semantic Web technologies, specifically Linked Data, to represent, link and access data related to drugs and diseases provided by the Bio2RDF project. We use SPARQL queries to obtain drug and disease features to train classifiers. In case of a version update to the data, it will be possible to re-execute the queries and obtain new updated data. We collected a wider collection of data containing 816 drugs and 1393 diseases with their features. Predictions for gold standard data generated by combining multiple drug indications data sources were evaluated. We tried our method on a different dataset, compiled by [23], that show us the predictability of our method independent of the data compiled. A crucial flaw in a typical evaluation scheme for drug indication predictions that would make unrealistic predictions is failure to consider the paired nature of inputs [15]. We partitioned the data in distinct train and test sets where not only pairs but also drugs/diseases are not overlapped as suggested in [14] for drug- target interaction prediction. We tested several classifiers under different cross validation schemes and compared our approach with existing methods namely PREDICT, SLAMS. We observed that we had better predictive performance than the PREDICT and the SLAMS in disjoint cross-validation settings. Table 3. Potential indications for Reboxetine and prediction scores by our model Ranking Disease Probability 1 Narcolepsy 0.82 2 Depressive disorder 0.8 3 Parkinson Disease 0.78 4 Schizophrenia 0.77 5 Obsessive-Compulsive Disorder 0.71 6 Anxiety Disorders 0.71 7 Generalized Anxiety Disorder 0.71 8 Panic Disorder 0.69 9 Obesity 0.69 10 Cerebrovascular accident 0.67 11 Anorexia nervosa 0.67 12 Bulimia Nervosa 0.67 13 Attention deficit hyperactivity disorder 0.63 14 Eating Disorders 0.63 15 Post-Traumatic Stress Disorder 0.58 16 Fibromyalgia 0.46 17 Binge eating disorder 0.18 Acknowledgement. The first named author (R.C.) is grateful to TUBITAK for providing financial support under 2214-A programme. References 1. Brown, A.S., Patel, C.J.: A standard database for drug repositioning. Scientific Data 4, 170029 (2017) 2. Callahan, A., Cruz-Toledo, J., Ansell, P., Dumontier, M.: Bio2rdf release 2: im- proved coverage, interoperability and provenance of life science linked data. In: Extended Semantic Web Conference. pp. 200–212. Springer (2013) 3. Campillos, M., Kuhn, M., Gavin, A.C., Jensen, L.J., Bork, P.: Drug target identi- fication using side-effect similarity. Science 321(5886), 263–266 (2008) 4. Chiang, A.P., Butte, A.J.: Systematic evaluation of drug–disease relationships to identify leads for novel drug uses. Clinical Pharmacology & Therapeutics 86(5), 507–510 (2009) 5. Gottlieb, A., Stein, G.Y., Ruppin, E., Sharan, R.: Predict: a method for infer- ring novel drug indications with application to personalized medicine. Molecular systems biology 7(1), 496 (2011) 6. Guney, E.: Reproducible drug repurposing: When similarity does not suffice. In: PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. pp. 132–143 (2017) 7. Hay, P.J., Claudino, A.M.: Bulimia nervosa: online interventions. BMJ clinical evidence 2015 (2015) 8. Hu, G., Agarwal, P.: Human disease-drug network based on genomic expression profiles. PloS one 4(8), e6536 (2009) 9. Kuhn, M., Letunic, I., Jensen, L.J., Bork, P.: The sider database of drugs and side effects. Nucleic acids research 44(D1), D1075–D1079 (2015) 10. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N., et al.: The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. science 313(5795), 1929–1935 (2006) 11. Larrosa, O., de la Llave, Y., Barrio, S., Granizo, J.J., Garcia-Borreguero, D.: Stim- ulant and anticataplectic effects of reboxetine in patients with narcolepsy: a pilot study. Sleep 24(3), 282–285 (2001) 12. Lemke, M.R.: Effect of reboxetine on depression in parkinson’s disease patients. The Journal of clinical psychiatry 63(4), 300–304 (2002) 13. Melville, J.L., Hirst, J.D.: TMACC: Interpretable Correlation Descriptors for Quantitative StructureActivity Relationships. J. Chem. Inf. Model. 47(2), 626– 634 (Mar 2007), http://dx.doi.org/10.1021/ci6004178 14. Pahikkala, T., Airola, A., Pietilä, S., Shakyawar, S., Szwajda, A., Tang, J., Ait- tokallio, T.: Toward more realistic drug–target interaction predictions. Briefings in bioinformatics 16(2), 325–337 (2014) 15. Park, Y., Marcotte, E.M.: Flaws in evaluation schemes for pair-input computa- tional predictions. Nature methods 9(12), 1134–1136 (2012) 16. Ratner, S., Laor, N., Bronstein, Y., Weizman, A., Toren, P.: Six-week open-label re- boxetine treatment in children and adolescents with attention-deficit/hyperactivity disorder. Journal of the American Academy of Child & Adolescent Psychiatry 44(5), 428–433 (2005) 17. Schmidt, C., Leibiger, J., Fendt, M.: The norepinephrine reuptake inhibitor rebox- etine is more potent in treating murine narcoleptic episodes than the serotonin reuptake inhibitor escitalopram. Behavioural brain research 308, 205–210 (2016) 18. Silveira, R.O., Zanatto, V., Appolinario, J., Kapczinski, F.: An open trial of rebox- etine in obese patients with binge eating disorder. Eating and Weight Disorders- Studies on Anorexia, Bulimia and Obesity 10(4), e93–e96 (2005) 19. Tehrani-Doost, M., Moallemi, S., Shahrivar, Z.: An open-label trial of reboxetine in children and adolescents with attention-deficit/hyperactivity disorder. Journal of child and adolescent psychopharmacology 18(2), 179–184 (2008) 20. Versiani, M., Cassano, G., Perugi, G., Benedetti, A., Mastalli, L., Nardi, A., Savino, M.: Reboxetine, a selective norepinephrine reuptake inhibitor, is an effective and well-tolerated treatment for panic disorder. The Journal of clinical psychiatry (2002) 21. Wilkinson, M., Dumontier, M., Aalbersberg, I., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J., da Silva Santos, L., Bourne, P., Bouwman, J., Brookes, A., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C., Finkers, R., Gonzalez-Beltran, A., Gray, A., Groth, P., Goble, C., Grethe, J., Heringa, J., ’t Hoen, P., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S., Martone, M., Mons, A., Packer, A., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M., Thompson, M., Van Der Lei, J., Van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B.: The fair guiding principles for scientific data management and stewardship. Scientific Data 3 (2016) 22. Yang, L., Agarwal, P.: Systematic drug repositioning based on clinical side-effects. PloS one 6(12), e28025 (2011) 23. Zhang, P., Agarwal, P., Obradovic, Z.: Computational drug repositioning by rank- ing and integrating multiple data sources. In: Joint European Conference on Ma- chine Learning and Knowledge Discovery in Databases. pp. 579–594. Springer (2013)