Automated drug repurposing workflow for rare diseases Carmen A.T. Reep1,∗ , Katherine Wolstencroft1 , Eleni Mina2,† and Núria Queralt-Rosinach2,† 1 The Leiden Institute of Advanced Computer Science (LIACS), Niels Bohrweg 1, 2333 CA Leiden, The Netherlands 2 Leiden University Medical Centre, Department of Human Genetics, Einthovenweg 20, 2333 ZC Leiden, The Netherlands Abstract There are over 7000 known rare diseases. Each one affects fewer than 1 in 2000 individuals, but collectively, they affect approximately 10% of the European and American populations. Developing treatment options for rare diseases is essential for those with such conditions, but as drug development is a time-consuming and costly process, developing new treatments is not often economically viable. The result is that fewer than 6% of rare disease have approved treatment options. The rare disease research community are adopting new approaches to this problem, where the focus is not on developing novel treatments, but on identifying approved drugs which could be repurposed to treat other conditions. These computational drug repurposing approaches require data and knowledge integration, to establish links between diseases, their symptoms, associated genes and drugs. Representing these concepts and relationships as a knowledge graph of machine-readable nodes and edges, enables predictions to be made about missing edges that may represent new drug target interactions. In this study, we developed an automated computational drug repurposing workflow for rare diseases. The workflow integrates data mining and knowledge graph techniques, using the BioKnowledge Reviewer, together with state-of-the-art machine learning for link prediction, using graph embeddings and XGBoost. We demonstrate the utility of the workflow with a use-case in Huntington’s disease, which is a rare neurodegenerative disorder of the central nervous system, caused by an elongated CAG repeat on the huntingtin gene. To evaluate the predictions made by the workflow, we manually explored the three top- ranked drug predictions for Huntington’s disease. All three drugs are supported by evidence as plausible candidates. A similar analysis of Spinocerebellar ataxia type 1 (SCA1), a related neurodegenerative condition, yielded similarly promising results and showed the reproducibility of the method and workflow. The workflow is available at https://github.com/carmenreep/DrugRepurposing. Keywords drug repurposing, workflow, knowledge graph, rare diseases, Huntington’s disease SWAT4HCLS 2024: The 15th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences, February 26–29, 2024, Leiden, The Netherlands Envelope-Open c.reep@erasmusmc.nl (C. A.T. Reep); k.j.wolstencroft@liacs.leidenuniv.nl (K. Wolstencroft); e.mina@lumc.nl (E. Mina); nqueralt.r@gmail.com (N. Queralt-Rosinach) Orcid 0000-0002-9408-7324 (C. A.T. Reep); 0000-0002-1279-5133 (K. Wolstencroft); 0000-0002-8972-9206 (E. Mina); 0000-0003-0169-8159 (N. Queralt-Rosinach) © 2024Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Background Rare diseases are low-prevalent disorders caused by pathogenic mutations or harmful envi- ronmental factors that can have chronic, debilitating, or life-threatening effects [1]. Currently, there are over 7000 rare diseases that affect approximately 10% of the European and American populations, yet fewer than 6% have an approved treatment option [2]. This highlights the pressing need for developing therapies targeting rare diseases. However, the development of a new drug can be a time-consuming and costly process, taking up to 15 years and costing as much as US$2.5 billion [2]. Consequently, the development of novel drugs for rare diseases, which affect only a small number of individuals, is not pursued frequently, as it is less likely to provide a return on investment for pharmaceutical companies [1]. A cost-efficient and faster way to provide drugs for rare diseases is via computational drug repurposing. Drug repurposing is the process of identifying for an already approved or investigational drug a new use outside the scope of the original medical indication. For example, a drug could be repurposed for a different disease, based on the knowledge that drugs target particular pathways and disease mechanisms that may be shared by multiple diseases [3]. Computational drug repurposing aims to predict novel drug-disease associations, which can be achieved by predicting drug-target interactions (DTIs) [4]. The computational prediction of new DTIs can provide insights into potential pathological and drug mechanisms, as well as drug repurposing and design, helping researchers to generate testable hypotheses in the lab [5]. Network-based data integration and machine learning-based methods for DTIs prediction can mitigate costly and time consuming experimental verifications and are the current state of the art approaches in computational drug repurposing [6, 7, 8, 9, 10]. The landscape of biomedical information resources is heterogeneous and broad, yet most current methods for predicting DTIs are limited to homogeneous networks or bipartite models, failing to account for the intricate relationships among diverse data sources [6]. To fully exploit the potential of computational drug repurposing, we propose an automated workflow for predicting DTIs that does take complex relationships among diverse data sources into account. For our DTI prediction, we use the biophysical drug repurposing approach, which is based on the hypothesis that structurally similar drug molecules share similar targets. We extract biological data from multiple online databases using the BioKnowledge reviewer library (https://github.com/SuLab/bioknowledge-reviewer), a tool developed by Queralt-Rosinach et al. [11]. This library integrates heterogenous knowledge and data into a knowledge graph, which is a machine-readable semantic representation of relational information, where concepts are encoded as nodes and relationships between them as edges. Now the prediction of DTIs can be framed as a link prediction problem, where the goal is to identify missing edges in the knowledge graph between drugs and targets that represent potential DTIs. To address this challenge, our proposed automated drug repurposing workflow leverages both network-based analysis and machine learning methods. Network-based methods help to identify potential interactions based on network topology and structural features, while machine learning methods can use more complex data features to make predictions. By combining these approaches, our automated workflow is able to discover new DTIs that can be further used by biologists to generate drug repurposing hypotheses that can be tested in the lab. The drug repurposing workflow generalises to any rare disease, but as a proof-of-concept, we focus on Huntington’s disease (HD). HD is a rare neurodegenerative disorder of the central nervous system characterized by dementia, involuntary movements due to the movement disorder chorea and behavioural and psychiatric disturbances [12]. There are some symptomatic treatments available but because their effects are limited, there is a constant need for better, modifying drugs to treat symptoms of the disease [12]. We present the automated data mining workflow as a web application, using Flask, which makes the workflow accessible for researchers with no technical expertise. The Python code for running this app is accessible as a Docker container, available at https://github.com/carmenreep/ DrugRepurposing, which runs on a laptop with minimal specifications (at least CPU 2.80 GHz). 2. Methods 2.1. Workflow steps Figure 1 provides an overview of the proposed drug repurposing workflow. The workflow comprises four main steps: (1) creation of the knowledge graph, (2) embedding of the graph, (3) creation of edge representations, and (4) training of a supervised machine learning model. Besides finding missing edges (potential DTIs) in the knowledge graph, the machine learning model also predicts the interaction types of these missing edges. Good embeddings for this link prediction task were achieved through enriched information on the drugs and target sites. 2.2. Data sources To obtain both human and animal biological data and metadata, the workflow uses the Monarch Biolink API version 1.1.14 (https://api.monarchinitiative.org/api). The Monarch Initiative, a collab- orative, open science project, seeks to semantically integrate genotype-phenotype information from numerous sources and species [13]. To integrate drugs into the knowledge graph, the workflow utilizes the Drug-Gene Interaction Database (DGIdb), a web resource that aggregates information on drug-gene interactions and druggable genes from various sources, including publications, databases, and web-based resources [14]. We obtain drug-gene information from DGIdb using its API (version v2) available at https://dgidb.org/api. 2.3. Knowledge graph construction The workflow leverages the BioKnowledge reviewer library to extract and integrate data from online sources into a knowledge graph. Starting with a list of seed nodes, a Monarch network is created by including the first layer of neighbours and relations from Monarch for each seed node, along with their ortholog-phenotype nodes. A seed node can take the form of a disease phenotype MIM number, such as ’143100’ representing Huntington’s disease. The edges are formatted as triples, where each triple includes additional information such as the reference Uniform Resource Identifier (URI), the date when the information was obtained, and more information about the semantics of the relation. Nodes in the graph are identified using different biomedical ontologies in the OBO Foundry [https://doi.org/10.1093/database/baab069] maintained or used by Monarch and contain other attributes such as semantic group, URI, label, Table 1 The three drug-gene interaction categories, with all interaction types that belong to each category. Source: https://www.dgidb.org/interaction_types. category ID interaction types inhibits RO:0002408 antagonist, antibody, antisense oligonucleotide, blocker, cleavage, in- hibitor, inhibitory allosteric modulator, inverse agonist, negative mod- ulator, partial antagonist, suppressor activates RO:0002406 activator, agonist, chaperone, cofactor, inducer, partial agonist, positive modulator, stimulator, vaccine regulates RO:0011002 NA, None, n/a, other/unknown, adduct, allosteric modulator, binder, ligand, modulator, multitarget, potentiator, product of, substrate name, synonyms, and description. The URI serves as a link to a web page that provides a more detailed description of the ontology term representing the node [11]. To obtain drug-target information, we use DGIdb and take all genes (targets) in the Monarch graph as seeds. First, we need to map the Monarch genes to Entrez Gene identifiers (Entrez ID), which are used as a standard gene identifier system [15]. We accomplish this using the BioThings MyGene.info API, accessed with the Python wrapper biothings_client version v0.2.6 [16]. For each gene, we obtain a list of drugs (ID, name) that interact with the gene, along with the type of interaction and interaction source. The drug identifiers are from either the ChEMBL Database [17] or the Wikidata knowledge base [18]. There are various interaction types, such as ’activator’, ’blocker’, but to improve our predictions, we used the interaction direction (inhibits or activates) instead of the interaction types themselves [14]. Because some relations lack direction, we introduced a third category called ’regulates’. Table 1 shows each interaction direction category along with the interaction types belonging to that category. We mapped these three interaction groups to URIs using the OBO Relations Ontology (RO) https://www.ebi.ac.uk/ols/ontologies/ro, version 2022-05-23. To enable the biophysical drug repurposing approach, it is necessary to identify structurally similar drugs. In our workflow, we use the Tanimoto coefficient [19] to measure the similarity between drugs. To achieve this, we first retrieve the SMILES chemical structure notation for each drug in our graph using the BioThings MyGene.info API accessed with the biothings_client version v0.2.6 [16]. Subsequently, we convert the SMILES structures into RDKit molecule objects using the RDKit Python package version 2022.3.2 with the Chem module [20]. The RDKit molecule objects are then transformed into Morgan fingerprints using the GetMorganFin- gerprintAsBitVect() function of the AllChem RDKit module. Using the BulkTanimotoSimilarity() function of the DataStruct module from RDKit, we can calculate Tanimoto coefficients between every possible pair of drugs in the graph. This results in a large number of weighted edges, which can lead to a complex network. To mitigate this, we adopt a method by Thafar et al. [21], where all similarity scores are ranked in descending order and only the top-10 most similar drugs are retained, similar to the k-nearest neighbours algorithm. Finally, we label all similarity edges with the ‘CHEMINF:000481’ ID, ‘http://semanticscience.org/resource/CHEMINF_000481’ URI, and the human readable string description ‘similar to’. The final graph is transformed into a Resource Description Framework (RDF) graph [22], Table 2 All entity groups in the HD chorea KG graph with their SIO identifiers and description (source: Seman- ticscience Integrated Ontology (SIO) [23]). Column ‘count’ shows the number of nodes in each entity group. entity identifier description count drug SIO:010038 A drug is a chemical substance that contains one or more 1352 active ingredients that regulate one or more biological processes. gene SIO:010035 A gene is part of a nucleic acid that contains all the 284 necessary elements to encode a functional transcript. disease SIO:010299 A disease is the outward manifestation of one or more 194 disorders. genotype SIO:001079 A genotype is a functional specification of a biological en- 127 tity in terms of its genetic composition (or lack thereof). variant SIO:001381 A genomic sequence variant is part of a nucleic acid 106 which is compositionally different than another reference genomic part. phenotype SIO:010056 A phenotype is an observable characteristic of an indi- 71 vidual. pathway SIO:001107 A pathway is an effective specification that outlines a 49 set of actions that forms a way to achieve an objective. where each entity, relationship, and entity class (gene, drug, etc.) is represented as an ontology term by its URI. As Monarch did not provide identifiers for the entity classes, we manually mapped the entity class labels to terms in ontologies and used their URIs. To achieve this, we utilized the Semanticscience Integrated Ontology (SIO) [23] (http://semanticscience.org/on- tology/sio.owl) version 1.53 and obtained URIs using the URI resolution service identifiers.org (https://registry.identifiers.org/registry/sio, accessed June 2022). Table 2 presents the specific URIs we used for the entity classes. To perform this transformation, we extended the BioKnowledge reviewer by using the RDFLib Python package version 6.1.1 [24] and ensured that the graph was stored in Turtle format [25]. 2.4. Graph embedding To prepare the knowledge graph for embedding, we first remove all known drug-gene interac- tions from the graph. This is important to prevent bias in the prediction task, as keeping these edges would make the embedding vectors of the drugs very similar to the embedding vectors of the genes they interact with. We therefore split the graph into two separate graphs, one for drugs and one for genes. Each graph is then embedded separately using a graph embedding algorithm. After the embedding, the drug vectors are fused with the gene vectors to obtain drug-gene edges, which are used for training the machine learning model, as explained in more detail in the XGBoost prediction model section below. Celebi et al. [26] compared different knowledge graph embedding methods for drug-drug interaction prediction, and found that RDF2Vec with Skip-Gram generally outperforms other methods. Therefore, this workflow employs RDF2Vec for graph embedding. RDF2Vec adapts the language modelling approach of Word2Vec to RDF graph embeddings [26]. First, random walks are performed over the graph to generate sequences of entities and relations. Then, the Skip-Gram model is used to learn one embedding for each entity/relationship in the graph. After training, semantically and syntactically similar entities/relationships have similar embeddings [26]. For the prediction task in this study, only the drug and gene vectors are of interest, and therefore, only these vectors were selected for further computation. For RDF2Vec, the Python function RDF2VecTransformer from the rdf2vec module of the package pyrdf2vec version 0.2.3 is used [27]. The maximum depth of one walk is set to 4 and for each entity in the graph, the maximum number of walks is set to 10. 2.5. Fused embeddings for link prediction To train a supervised machine learning model, it is essential to have both positive and negative samples of data [26]. The positive samples are all known interactions (regulates, inhibits, or activates). The negative samples can be obtained from unknown interactions between drugs and genes. Edge embeddings for positive and negative samples are generated by adopting a node embedding fusion approach. For every possible drug-gene combination, we obtain one embedding by fusing the drug embedding and the gene embedding with the Hadamard operator, which is a strong operator for learning edge features in link prediction tasks [28]. We then add the class of the interaction (inhibits, activates, or regulates) to the resulting embedding. For edges that do not exist in the graph, we assign the label ”unknown” to represent the unknown interaction class. The prediction data for our machine learning model includes all unknown drug-gene interac- tions involving genes that contribute to the disease phenotype of interest. The negative samples of the training data are all unknown interactions that are not the prediction data. However, the number of negative samples significantly outweighs the number of positive samples. Including all of these negative samples could result in data imbalance and affect the performance of our model [26]. To address this issue, we decided to downsample the negative cases by randomly selecting negative samples with a sample size equal to the class in the positive set with the largest number of interactions (regulates, inhibits, or activates). 2.6. XGBoost prediction model For our machine learning model, we utilized the XGBoost classifier proposed by Thafar et al. in 2021 [21]. To implement the model, we used the XGBClassifier() function from the Python package xgboost (version 1.3.3) [29]. We set the learning objective to ‘multi:softmax’, which allows XGBoost to optimize the likelihood of each class label and assign a probability to each possible class. To address minor class imbalance in our positive sample, we computed sample weights using the compute_sample_weight() function from sklearn [30] version 1.1.1. These weights are then used for training the model, which provides some bias towards the minority classes during training. To optimize the hyperparameters of our model, we conducted a randomized search on Table 3 The hyperparameter search space of the XGBoost classifier, with descriptions of each parameter. Source: https://xgboost.readthedocs.io/en/stable/parameter.html. uniform(a,b) indicates a uniform distribution on (a,b). parameter description search space min_child_weight Minimum sum of instance weight (hessian) 2, 3, 5, 8, 13, 20, 30 needed in a child. gamma Minimum loss reduction required to make 0, 0.2, 0.5, 0.8, 1.2, 1.6, 2.5, 4, 6 a further partition on a leaf node of the tree. reg_alpha L2 regularization term on weights. 0, 0.5, 1, 3, 5, 10 reg_lambda L1 regularization term on weights. 0, 0.5, 1, 3, 5, 10 subsample Subsample ratio of the training instances. uniform(0.5, 1) colsample_bytree Subsample ratio of columns when con- uniform(0.2, 1) structing each tree. max_depth Maximum depth of a tree. 4, 6, 8, 10, 12, 14, 16 n_estimators Number of boosting rounds. 35, 45, 50, 70, 80, 90, 100 learning_rate Step size shrinkage used after each boost- uniform(0, 0.3) ing step to prevent overfitting. the search space presented in Table 3 using the RandomizedSearchCV() function from the model_selection module of the sklearn Python package (version 1.1.1) [30]. We set the number of parameter setting combinations to be tested (n_iter) to 20. Given the challenge of identifying negative examples of drug-target pairs, as unlinked drugs and targets may simply represent drug-target pairs that have not been identified yet, we opted against conducting an error analysis. Instead, the model performance is assessed using the repeated stratified k-fold cross-validation technique alongside the F1-score metric.. The number of subsets for the k-fold cross validation is set to 10 and the number of repeats is set to 5. During each iteration of the process, the F1-score is calculated and averaged for each class X. Finally, the average F1-score over all k iterations and number of repeats is computed to obtain the final evaluation metric. The best hyperparameters are used to build the final model. The confidence of each prediction is obtained using the predict_proba() function of xgboost version 1.3.3, which returns the probability of an interaction belonging to its predicted class. 2.7. DTI ranking and validation For every gene in the graph that is associated with the symptom of interest, an interaction type and score is predicted for every drug in the dataset, provided that this interaction does not exist in the graph. To prioritize the most promising drug candidates for further investigation, we perform a ranking step based on the predicted positive interactions. First, we remove all predictions with a confidence score lower than 0.9 to focus only on the most confident predictions. Next, we rank the drugs based on the number of positive interactions they have with the genes associated with the symptom. This ranking approach is based on the hypothesis that drugs with more positive interactions with genes that cause a symptom are more likely to be effective in alleviating that symptom. In the case of drugs with the same number of interactions, we use the sum of prediction confidence scores as a secondary ranking criterion, with drugs having higher sums being ranked higher. 3. Results Our workflow was initially run with the terms ”huntington’s disease” (OMIM number ‘143100’) and ”chorea” (‘HP:0002072’) as seeds representing the disease and symptom fields respectively, for constructing the knowledge graph. The graph was created on 2022-06-27 and has in total 2189 nodes and 17467 edges. Figure 2 provides an overview of all entities and relationships in the graph and Table 4 shows the identifiers and descriptions of each relation between nodes in this graph. The graph includes 1352 drugs and 284 genes, resulting in a total of 383,968 edge representations, of which 1753 are known (1301 regulates, 391 inhibits, and 61 activates). This graph has 200 genes that are associated with the symptom chorea, which are the genes of interest, and there are 1077 known drug-gene edges with these 200 genes, indicating that the prediction data consists of 269,323 unknown interactions of potential interest. To deal with the imbalance between the larger number of negative samples and the comparatively smaller number of positive samples, 1301 negative samples were randomly selected, to balance the number of the largest interactions in the positive class (regulates). The best XGBoost hyperparameters can be found in Table 5, and the F1 score with these hyperparameters is 0.867. The trained XGBoost model was used to predict the classes of the unknown drug-gene interactions of interest. Table 6 shows the predicted top ten ranked drugs that interact with genes that are associated with the phenotype chorea. We manually explored the top ranked predictions for HD that are associated with chorea. Below we present the top three candidates. Table 7 presents the two highest ranked drugs and the genes that these drugs have a positive predicted interaction with. The top predicted ranked drug is CHEMBL29097. CHEMBL29097 (synonym MK-886) is an inhibitor of 5-lipoxygenase-activating protein activity, currently in preclinical phase. It has been found that 10 microM MK-886 can abolish the biosynthetic production of cysteinyl leukotrienes (CysLTs), which is suggested to be involved in brain inflammation and neurological diseases [33].In addition to its anti-inflammatory activity, MK-886 has been shown to activate the proteasome which is known to have a causative role in HD [34]. Impaired function of the proteasome leads to the formation of intracellular aggregates in the nucleus as the proteasome cannot clear efficiently misfolded huntingtin proteins [35]. The second highest ranked drug is baicalein. Baicalein (CHEMBL8260) is a flavonoid isolated from the traditional Chinese medicinal herbal Scutellaria baicalensis Georgi, currently on Phase 2. Baicalein has known anti-inflammatory and neuroprotective efficacy in neurodegenerative disease models [36]. Rui et al. [36] studied the effects of baicalein on inflammasome-induced neuroinflammation in Parkinson’s disease(PD) and found that baicalein can suppresses MPTP- induced nigral dopaminergic neuron death, glial activation, and motor dysfunction in mice by suppressing the NLRP3/caspase-1/GSDMD pathway. In addition, several studies have have demonstrated that baicalein protects neurons in animal models of Alzheimer’s disease (AD) and PD by inhibiting neuroinflammation [37]. Amphotericin b (CHEMBL267345) was another drug on our list that ranked very high. Am- potericin b is an approved antifungal drug used to treat serious fungal infections. Experimental evidence shows that some antibiotics and antifungal medication have neuroprotective action through anti-aggregating activity on disease-associated proteins [38]. Although this drug has been shown to cause a delay in the formation of amyloid-𝛽, it was also found to induce toxicity [39]. However, Soler et al., [40] developed a derivative of amphotericin that has anti-aggregating action but lacks toxicity and antimicrobial activity [38]. 3.1. Other rare diseases To demonstrate the reusability of our approach, we also applied our methodology to another rare disease that currently lacks treatment; Spinocerebellar ataxia type 1 (SCA1). We used as seeds the terms ”SCA1” (OMIM number ’164400’) and the symptom ”hyporeflexia” (HP:0001265) to run our drug repurposing workflow and below we describe few of the top hits. The first prioritized drug by our workflow was Dovitinib (CHEMBL522892) currently in phase 3. Dovitinib is a pan receptor tyrosine kinase (RTK) inhibitor that has anti-tumor activity in pre clinical models of several cancers [41]. It has been recently suggested as a candidate treatment for AD because it normalizes 𝛽 amyloid mediated transcriptional responses by targeting the CREB3L2-ATF4 heterodimerization which is responsible for the majority of the transcriptional changes occuring in AD neurons [42]. Its well tolerated safety profile and the ability to cross the blood brain barrier [42] makes it an interesting candidate for AD but also potentially for other neurodegenerative disorders that exhibit similar disease mediated changes like AD. The second predicted drug on the list was broquinaldol (CHEMBL1394319), a small molecule that has antifungal and antibacterial activity. This is an investigational drug that was found to have activity against thyroid cancer in a high throughput screening experiment [43]. However, there is currently no evidence for being associated with neurodegenerative diseases. Number three on the candidate drug list for SCA1 was an interesting compound, astemizole (CHEMBL296419). Astemizole is an approved second generation antihistamine drug [44] that has been found to rescue motor phenotype in a Drosophila model of PD [45]. 4. Discussion This work presents a novel disease-drug profiling approach to identifying potential candidate compounds that could alleviate the symptoms of a rare disease. It combines two established, and widely accepted approaches, of mining rare disease-specific data from multiple public databases into a knowledge graph [11], and graph-based machine learning approaches to identifying drug- target interactions [21]. The result is an automated workflow which makes disease-drug profile predictions targeted to specific rare diseases. We demonstrated its utility using predictions from Huntington’s disease and SCA1. The advance that this work provides to the field is the use of rare disease specific knowledge graphs. Using BioKnowlegde reviewer in a drug repurposing automated workflow enables to learn from a comprehensive view of the underlying druggable rare disease biology and pathogenesis of interest. This is advantageous over current integration methods used in rare disease research, which use information about thousands of complex disorders [2], because it leverages knowledge for precision medicine. Another advantage is that by comparing to existing solutions [8, 9, 10], our method harnesses heterogeneous and expressive semantic graphs for DTI prediction beyond bipartite networks. Integrating new types of entities with Semantic Web technologies enables us to represent more complex relations around drugs and targets, and it opens the possibility of learning from them and exploiting the semantics by means of methods such as RDF2Vec graph embedding methods. Through sophisticated graph-based algorithms, we can traverse the knowledge graph to identify patterns in the data and predict potential new relationships between drugs and diseases. We demonstrated that knowledge graphs and graph machine learning used streamlined in an automated workflow gives testable DTI hypotheses for drug repurposing in the rare disease area. This can support researchers to systematically generate compound prioritization coupled with well-designed validation experiments to discover treatments for rare diseases in a timely and cost-effective manner. One limitation is that we did not integrate domain expert knowledge on disease pathobiology with patient data, which can be the basis for highly innovative drugs. While [11] gave a solution to include expert knowledge in graphs, access to patient data is a serious problem in health research. However, projects such as the EJP-RD1 are providing Semantic Web based solutions for patient data sharing. Our results provide some interesting candidates that could potentially be of great value for the rare disease community. Some of our prioritized drugs are already associated with other neurodegenerative disorders (AD and PD) targetting neuroinflammation, which is a hallmark of the HD pathology. Other candidates (Broquinaldol and Amphotericin b) belong to the class of antifungal and antibacterial medication. These types of drugs have drawn a lot of attention and although they are mainly used to treat infections new applications are being discovered. It has been reported that antibiotics, for example Doxycycline and minocycline, have neuroprotective effects due to their anti-inflammatory properties [46]. The workflow is presented as a web application and yields promising results in a reusable and reproducible way for the rare diseases community. In the future, it could be extended and im- proved by the addition of experimentally validated negative interactions from reliable databases. Our current approach uses unknown drug-gene pairs as negative samples for classification. It is therefore possible that this set includes currently unknown positive interactions, which may adversely affect model training [47]. Additionally, the knowledge graph could be extended to include further information about each drug, such as side-effects and drug interactions, and additional input seeds could be obtained from the Monarch database, such as genes and metabolites associated with the particular disease, or related diseases. Moreover, integrating the predicted embeddings into the knowledge graph would enable the evaluation of performance across diverse prediction methods, offering valuable insights into model efficacy and versatility. Lastly, it is important to note that the RDF graphs are stored in turtle format at the location where the code is executed, to enable further use in code execution and analysis. The RDF graph is currently not served as a live RDF/SPARQL endpoint, which presents an opportunity for improvement in our approach. This aligns with broader efforts in the field to enhance the transparency and accessibility of machine learning outcomes. 1 https://www.ejprarediseases.org/ 5. Conclusion Integrating data into knowledge graphs with state-of-the-art graph-based machine learning methods results in a novel automated drug repurposing workflow, specifically suited for rare diseases, where data tends to be sparse and distributed. The workflow produces a ranked list of candidate compounds, which serve as new hypotheses for drug treatments. The drug repurposing workflow generalises to any rare disease, but as a proof-of-concept, we focused on Huntington’s disease, and a related condition, SCA1. We identified several promising candidate drugs for Huntington’s Disease for the symptom chorea, demonstrating the potential of our approach. With further testing and validation, these candidates could be explored as potential treatments for the disease. The workflow is provided as a web application, in a publicly available Docker container. It is therefore accessible for researchers with no technical expertise and is a reusable and reproducible application for the rare disease community. References [1] S. Shah, M. M. Dooms, S. Amaral-Garcia, M. Igoillo-Esteve, Current drug repurposing strategies for rare neurodegenerative disorders, Frontiers in Pharmacology 12 (2021). doi:10.3389/fphar.2021.768023 . [2] H. I. Roessler, N. V. Knoers, M. M. van Haelst, G. van Haaften, Drug repurposing for rare diseases, Trends in Pharmacological Sciences 42 (2021) 255–267. doi:10.1016/j.tips. 2021.01.003 . [3] T. B. Malas, W. J. Vlietstra, R. Kudrin, S. Starikov, M. Charrout, M. Roos, D. J. M. Peters, J. A. Kors, R. Vos, P. A. C. ‘t Hoen, E. M. van Mulligen, K. M. Hettne, Drug prioritization using the semantic properties of a knowledge graph, Scientific Reports 9 (2019). doi:10. 1038/s41598- 019- 42806- 6 . [4] M. D. Paranjpe, A. Taubes, M. Sirota, Insights into computational drug repurposing for neurodegenerative disease, Trends in Pharmacological Sciences 40 (2019) 565–576. doi:10.1016/j.tips.2019.06.003 . [5] W. Ba-alawi, O. Soufan, M. Essack, P. Kalnis, V. B. Bajic, Daspfind: new efficient method to predict drug–target interactions, Journal of Cheminformatics 8 (2016) 1758–2946. doi:10.1186/s13321- 016- 0128- 4 . [6] Y. Luo, X. Zhao, J. Zhou, J. Yang, Y. Zhang, W. Kuang, J. Peng, L. Chen, J. Zeng, A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information, Nature Communications 8 (2017) 2041–1723. doi:10.1038/s41467- 017- 00680- 8 . [7] C. Chen, H. Shi, Z. Jiang, A. Salhi, R. Chen, X. Cui, B. Yu, Dnn-dtis: Improved drug- target interactions prediction using xgboost feature selection and deep neural network, Computers in Biology and Medicine 136 (2021) 104676. doi:https://doi.org/10.1016/ j.compbiomed.2021.104676 . [8] K. Huang, T. Fu, L. M. Glass, M. Zitnik, C. Xiao, J. Sun, DeepPurpose: a deep learning library for drug–target interaction prediction, Bioinformatics 36 (2020) 5545–5547. doi:10. 1093/bioinformatics/btaa1005 . [9] Y. Kalakoti, S. Yadav, D. Sundar, Deep neural network-assisted drug recommendation systems for identifying potential drug–target interactions, American Chemical Society 7 (2022) 12138–12146. doi:10.1021/acsomega.2c00424 . [10] E. Amiri Souri, R. Laddach, S. N. Karagiannis, L. G. Papageorgiou, S. Tsoka, Novel drug- target interactions via link prediction and network embedding, BMC Bioinformatics 23 (2022) 1471–2105. doi:10.1186/s12859- 022- 04650- w . [11] N. Queralt-Rosinach, G. S. Stupp, T. S. Li, M. Mayers, M. E. Hoatlin, M. Might, B. M. Good, A. I. Su, Structured reviews for data and knowledge-driven research, Database 2020 (2020). doi:10.1093/database/baaa015 . [12] R. A. Roos, Huntington's disease: a clinical review, Orphanet Journal of Rare Diseases 5 (2010). doi:10.1186/1750- 1172- 5- 40 . [13] C. J. Mungall, J. A. McMurry, S. Köhler, J. P. Balhoff, C. Borromeo, M. Brush, S. Carbon, T. Conlin, N. Dunn, M. Engelstad, E. Foster, J. Gourdine, J. O. Jacobsen, D. Keith, B. Lar- away, S. E. Lewis, J. NguyenXuan, K. Shefchek, N. Vasilevsky, Z. Yuan, N. Washington, H. Hochheiser, T. Groza, D. Smedley, P. N. Robinson, M. A. Haendel, The monarch initia- tive: an integrative data and analytic platform connecting phenotypes to genotypes across species, Nucleic Acids Research 45 (2016) D712–D722. doi:10.1093/nar/gkw1128 . [14] S. L. Freshour, S. Kiwala, K. C. Cotto, A. C. Coffman, J. F. McMichael, J. J. Song, M. Griffith, O. L. Griffith, A. H. Wagner, Integration of the drug–gene interaction database (DGIdb 4.0) with open crowdsource efforts, Nucleic Acids Research 49 (2020) D1144–D1151. doi:10.1093/nar/gkaa1084 . [15] D. Maglott, J. Ostell, K. D. Pruitt, T. Tatusova, Entrez gene: gene-centered information at NCBI, Nucleic Acids Research 35 (2007) D26–D31. doi:10.1093/nar/gkl993 . [16] C. Wu, biothings client, 2022. URL: https://github.com/biothings/biothings_client.py. [17] D. Mendez, A. Gaulton, A. P. Bento, J. Chambers, M. De Veij, E. Félix, M. P. Magariños, J. F. Mosquera, P. Mutowo, M. Nowotka, M. Gordillo-Marañón, F. Hunter, L. Junco, G. Mugum- bate, M. Rodriguez-Lopez, F. Atkinson, N. Bosc, C. J. Radoux, A. Segura-Cabrera, A. Hersey, A. R. Leach, ChEMBL: towards direct deposition of bioassay data, Nucleic Acids Res. 47 (2019) D930–D940. doi:10.1093/nar/gky1075 . [18] Wikimedia Foundation, Wikidata, 2022. URL: https://www.wikidata.org/wiki/Wikidata: Main_Page. [19] A. Kumar, Chemical similarity methods : A tutorial review, The Chemical Educator (2011) 46–50. doi:10.1333/s00897112344a . [20] Rdkit: Open-source cheminformatics, 2022. URL: http://www.rdkit.org. [21] M. A. Thafar, R. S. Olayan, S. Albaradei, V. B. Bajic, T. Gojobori, M. Essack, X. Gao, DTi2vec: Drug–target interaction prediction using network embedding and ensemble learning, Journal of Cheminformatics 13 (2021). doi:10.1186/s13321- 021- 00552- w . [22] World Wide Web Consortium, Resource Description Framework (RDF): Concepts and Ab- stract Syntax, 2014. URL: https://www.w3.org/TR/rdf11-concepts/, w3C Recommendation. [23] M. Dumontier, C. J. Baker, J. Baran, A. Callahan, L. Chepelev, J. Cruz-Toledo, N. R. Del Rio, G. Duck, L. I. Furlong, N. Keath, D. Klassen, J. P. McCusker, N. Queralt-Rosinach, M. Samwald, N. Villanueva-Rosales, M. D. Wilkinson, R. Hoehndorf, The semanticscience integrated ontology (SIO) for biomedical research and knowledge discovery, J. Biomed. Semantics 5 (2014) 14. doi:10.1186/2041- 1480- 5- 14 . [24] I. Aucamp, Rdflib, 2021. URL: https://github.com/RDFLib/rdflib. [25] RDF 1.1 turtle, 2014. URL: https://www.w3.org/TR/turtle/. [26] R. Celebi, H. Uyar, E. Yasar, O. Gumus, O. Dikenelli, M. Dumontier, Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings, BMC Bioinformatics 20 (2019). doi:10.1186/s12859- 019- 3284- 5 . [27] G. Vandewiele, B. Steenwinckel, T. Agozzino, M. Weyns, P. Bonte, F. Ongenae, F. D. Turck, pyRDF2Vec: Python Implementation and Extension of RDF2Vec, IDLab, 2020. URL: https://github.com/IBCNServices/pyRDF2Vec. [28] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, 2016. doi:10. 48550/ARXIV.1607.00653 . [29] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 785–794. doi:10.1145/2939672.2939785 . [30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, Édouard Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a. html. [31] C. Mungall, J. A. Overton, D. Osumi-Sutherland, M. Haendel, Mbrush, Obo-relations: 2015-10-29 release, 2015. doi:10.5281/ZENODO.32899 . [32] M. Brush, N. Matentzoglu, M. Haendel, Geno-ontology, 2022. URL: https://github.com/ monarch-initiative/GENO-ontology. [33] P. Ballerini, P. D. Iorio, R. Ciccarelli, F. Caciagli, A. Polp, A. Beraudi, S. Buccella, I. D'Al- imonte, M. D'Auro, E. Nargi, P. Patricelli, D. Visini, U. Traversa, P2y1 and cysteinyl leukotriene receptors mediate purine and cysteinyl leukotriene co-release in primary cultures of rat microglia, International Journal of Immunopathology and Pharma- cology 18 (2005) 255–268. doi:10.1177/039463200501800208 . [34] E. E. Liao, M. Yang, N. Nathan Kochen, N. Vunnam, A. R. Braun, D. M. Ferguson, J. N. Sachs, Proteasomal stimulation by mk886 and its derivatives can rescue tau- induced neurite pathology, Molecular neurobiology 60 (2023) 6133–6144. doi:10.1007/ s12035- 023- 03417- 5 . [35] T. R. Soares, S. D. Reis, B. R. Pinho, M. R. Duchen, J. M. Oliveira, Targeting the proteostasis network in huntington’s disease, Ageing Research Reviews 49 (2019) 92–103. doi:10.1016/ j.arr.2018.11.006 . [36] W. Rui, S. Li, H. Xiao, M. Xiao, J. Shi, Baicalein attenuates neuroinflammation by inhibiting NLRP3/caspase-1/GSDMD pathway in MPTP induced mice model of parkinson’s disease, Int. J. Neuropsychopharmacol. 23 (2020) 762–773. doi:10.1093/ijnp/pyaa060 . [37] Y. Li, J. Zhao, C. Hölscher, Therapeutic potential of baicalein in alzheimer’s disease and parkinson’s disease, CNS drugs 31 (2017) 639–652. doi:10.1007/s40263- 017- 0451- y . [38] S. B. Socias, F. González-Lizárraga, C. L. Avila, C. Vera, L. Acuña, J. E. Sepulveda-Diaz, E. Del- Bel, R. Raisman-Vozari, R. N. Chehin, Exploiting the therapeutic potential of ready-to-use drugs: Repurposing antibiotics against amyloid aggregation in neurodegenerative diseases, Progress in Neurobiology 162 (2018) 17–36. doi:10.1016/j.pneurobio.2017.12.002 . [39] F. Durães, M. Pinto, E. Sousa, Old drugs as new treatments for neurodegenerative diseases, Pharmaceuticals 11 (2018). doi:10.3390/ph11020044 . [40] L. Soler, P. Caffrey, H. E. McMahon, Effects of new amphotericin analogues on the scrapie isoform of the prion protein, Biochimica et Biophysica Acta (BBA) - General Subjects 1780 (2008) 1162–1167. doi:https://doi.org/10.1016/j.bbagen.2008.07.005 . [41] S. S. Yadav, J. Li, J. A. Stockert, B. Herzog, J. O’Connor, L. Garzon-Manco, R. Parsons, A. K. Tewari, K. K. Yadav, Induction of neuroendocrine differentiation in prostate cancer cells by dovitinib (tki-258) and its therapeutic implications, Translational Oncology 10 (2017) 357–366. doi:https://doi.org/10.1016/j.tranon.2017.01.011 . [42] C. G. Roque, K. M. Chung, E. P. McCurdy, R. Jagannathan, L. K. Randolph, K. Herline- Killian, J. Baleriola, U. Hengst, Creb3l2-atf4 heterodimerization defines a transcriptional hub of alzheimer’s disease gene expression linked to neuropathology, Science Advances 9 (2023) eadd2671. doi:10.1126/sciadv.add2671 . [43] L. Zhang, M. He, Y. Zhang, N. Nilubol, M. Shen, E. Kebebew, Quantita- tive High-Throughput Drug Screening Identifies Novel Classes of Drugs with Anticancer Activity in Thyroid Cancer Cells: Opportunities for Repurpos- ing, The Journal of Clinical Endocrinology Metabolism 97 (2012) E319–E328. doi:10.1210/jc.2011- 2671 . arXiv:https://academic.oup.com/jcem/article- pdf/97/3/E319/10416587/jcemE319.pdf . [44] P. M. Krstenansky, J. Robert J. Cluxton, Astemizole: A long-acting, nonsedating an- tihistamine, Drug Intelligence & Clinical Pharmacy 21 (1987) 947–953. doi:10.1177/ 106002808702101202 . [45] K. Styczyńska-Soczka, L. Zechini, L. Zografos, Validating the predicted effect of astemizole and ketoconazole using a drosophila model of parkinson’s disease, Assay and drug development technologies 15 (2017) 106–112. [46] A. Dominguez-Meijide, V. Parrales, E. Vasili, F. González-Lizárraga, A. König, D. F. Lázaro, A. Lannuzel, S. Haik, E. Del Bel, R. Chehín, R. Raisman-Vozari, P. P. Michel, N. Bizat, T. F. Outeiro, Doxycycline inhibits α-synuclein-associated pathologies in vitro and in vivo, Neurobiology of Disease 151 (2021) 105256. doi:https://doi.org/10.1016/j.nbd.2021. 105256 . [47] L. Xu, X. Ru, R. Song, Application of machine learning for drug–target interaction prediction, Frontiers in Genetics 12 (2021). URL: https://doi.org/10.3389/fgene.2021.680117. doi:10.3389/fgene.2021.680117 . Figure 1: The drug repurposing workflow. (1.) It takes as input a disease and symptom of interest, then constructs the knowledge graph using Monarch and DGIdb, adds drug-drug similarity edges based on SMILES compound structure, turns the graph into an RDF graph and removes drug-gene links ready for embedding. (2.) It then applies RDF2Vec, which creates random walks over the graph for each entity, trains a skip-gram model and outputs one feature vector for each entity in the graph. (3.) Next, it generates edge representations for each drug-gene pair and turns this into prediction and training data. (4.) Then it trains an XGBoost classification model using the prediction data, finds the best model by hyperparameter tuning using randomized search, evaluates using repeated stratified 10-fold CV and uses the best found model to predict the interactions of interest. Figure 2: Data model of the HD chorea graph. Overview of all entities and relations in the HD chorea knowledge graph. Table 4 All relations in the HD chorea KG with their identifiers and description (sources: OBO Relations Ontology [31]; GENO ontology [32]). Column ‘count’ shows the number of edges in each relation group. relation identifier description count similar to CHEM- Connects a molecular entity that is deemed similar to another accord- 13023 INF:000481 ing to some algorithm. regulates RO:0011002 The entity 𝑥 has an activity that regulates an activity of the entity 𝑦. 1301 has pheno- RO:0002200 A relationship that holds between a biological entity and a phenotype. 1016 type Here a phenotype is construed broadly as any kind of quality of an organism part, a collection of these qualities, or a change in quality or qualities. The subject of this relationship can be an organism, a genomic entity such as a gene or genotype, or a condition such as a disease. interacts with RO:0002434 A relationship that holds between two entities in which the processes 900 executed by the two entities are causally connected. inhibits RO:0002408 Directly negatively regulates. 391 causes condi- RO:0003303 A relationship between an entity (e.g. a genotype, genetic variation, 212 tion chemical, or environmental exposure) and a condition (a phenotype or disease), where the entity has some causal role for the condition. contributes to RO:0003304 A relationship between an entity (e.g. a genotype, genetic variation, 107 condition chemical, or environmental exposure) and a condition (a phenotype or disease), where the entity has some contributing role that influences the condition. has genotype GENO:0000222 A relationship that holds between a biological entity and some level of 106 genetic variation present in its genome. has role in RO:0003301 A relation between a biological, experimental, or computational arte- 103 modelling fact and an entity it is used to study, in virtue of its replicating or approximating features of the studied entity. correlated RO:0002610 A relationship that holds between two entities, where the entities 72 with exhibit a statistical dependence relationship. The entities may be statistical variables, or they may be other kinds of entities such as diseases, chemical entities or processes. activates RO:0002406 Directly positively regulates. 61 involved in RO:0002331 𝑥 is involved in 𝑦 if and only if 𝑥 enables some process 𝑦 ′ , and 𝑦 ′ is part of 𝑦. enables RO:0002327 catalyses. 49 colocalises RO:0002325 𝑥 colocalises with 𝑦 if and only if 𝑥 is transiently or peripherally associ- 38 with ated with 𝑦. is allele of GENO:0000408 A relation linking an instance of a variable feature (aka an allele) to a 39 genomic location/locus it occupies. This is typically a gene locus, but a feature may be an allele of other types of named loci such as QTLs, or alleles of some unnamed locus of arbitrary size. has affected GENO:0000418 A relation that holds between an instance of a genetic variation and a 22 feature genomic feature (typically a gene class) that is affected in its sequence or expression. is causal loss RO:0004012 Relates a gene to a condition, such that a mutation in this gene in a 15 of function germ cell impairs the function of the corresponding product and that germline is sufficient to produce the condition and that can be passed on to mutation of offspring. in 1 to 1 or- RO:HOM0000020Orthology that involves two genes that did not experience any dupli- 10 thology rela- cation after the speciation event that created them. tionship with is marker for RO:0002607 𝑥 is marker for 𝑦 if the presence or occurrence of 𝑦 is correlated with 1 the presence or occurrence of 𝑥, and the observation of 𝑥 is used to infer the presence or occurrence of 𝑦. Note that this does not imply that 𝑥 and 𝑦 are in a direct causal relationship, as it may be the case that there is a third entity 𝑧 that stands in a direct causal relationship with 𝑥 and 𝑦. is causal gain RO:0004011 Relates a gene to a condition, such that a mutation in this gene in a 1 of function germ cell provides a new function of the corresponding product and germline that is sufficient to produce the condition and that can be passed on mutation of to offspring. Table 5 The best found hyperparameters for the XGBoost model for the HD chorea graph. parameter best min_child_weight 5 gamma 0.5 reg_alpha 0.5 reg_lambda 3 colsample_bytree 0.8053 max_depth 10 n_estimators 50 learning_rate 0.1258 Table 6 The top ten ranked drugs for HD chorea. URI name https://identifiers.org/chembl:CHEMBL29097 CHEMBL29097 https://identifiers.org/chembl:CHEMBL8260 BAICALEIN https://identifiers.org/chembl:CHEMBL221137 EMBELIN https://identifiers.org/chembl:CHEMBL267345 AMPHOTERICIN B https://identifiers.org/chembl:CHEMBL308688 5,7-DIMETHOXYISOFLAVONE https://identifiers.org/chembl:CHEMBL2110660 IGMESINE https://identifiers.org/chembl:CHEMBL275809 FR-122047 https://identifiers.org/chembl:CHEMBL161343 ARACHIDONOYL GLYCINE https://identifiers.org/chembl:CHEMBL585 TRIAMTERENE https://identifiers.org/chembl:CHEMBL1269845 CHEMBL1269845 Table 7 The two highest ranked drugs for HD chorea with the genes they interact with, the interaction types and prediction confidence. drug ID drug label gene ID gene label interaction type confidence chembl:CHEMBL29097 CHEMBL29097 HGNC:10555 ATXN2 regulates 0.990 HGNC:10596 SCN8A inhibits 0.963 HGNC:4572 GRIA2 inhibits 0.934 HGNC:29259 TAOK1 regulates 0.955 HGNC:1461 CAMK2B regulates 0.935 HGNC:4235 GFAP (human) regulates 0.935 HGNC:713 ARSA regulates 0.969 HGNC:4076 GABRA2 activates 0.908 HGNC:4580 GRIK2 inhibits 0.974 HGNC:11005 SLC2A1 regulates 0.914 HGNC:30035 PIK3R5 inhibits 0.981 chembl:CHEMBL8260 BAICALEIN HGNC:10555 ATXN2 regulates 0.988 HGNC:10596 SCN8 inhibits 0.979 HGNC:4572 GRIA2 inhibits 0.946 HGNC:29259 TAOK1 regulates 0.963 HGNC:4584 GRIN1 inhibits 0.911 HGNC:1461 CAMK2B regulates 0.918 HGNC:4235 GFAP (human) regulates 0.905 HGNC:713 ARSA regulates 0.960 HGNC:4580 GRIK2 inhibits 0.971 HGNC:2295 CP (human) regulates 0.915 HGNC:30035 PIK3R5 inhibits 0.984