=Paper= {{Paper |id=Vol-2042/paper12 |storemode=property |title= Semantic Knowledge Graph Network Features for Drug Repurposing |pdfUrl=https://ceur-ws.org/Vol-2042/paper12.pdf |volume=Vol-2042 |authors=Tareq B. Malas,Roman Kudrin,Sergei Starikov,Dorien Peters,Marco Roos,Peter A.C. 't Hoen,Kristina Hettne |dblpUrl=https://dblp.org/rec/conf/swat4ls/MalasKSPRHH17 }} == Semantic Knowledge Graph Network Features for Drug Repurposing== https://ceur-ws.org/Vol-2042/paper12.pdf

Semantic Knowledge Graph Network Features for Drug
Repurposing
Tareq B. Malas1, Roman Kudrin1,2, Sergei Starikov1,2, Peter A.C. ́ʼt Hoen1, Dorien
J.M. Peters1, Marco Roos1, Kristina M. Hettne1
1 Department of Human Genetics, Leiden University Medical Center, 2300 RC Leiden, The

Netherlands
2 Faculty of Bioengineering and Bioinformatics, Moscow State University, 119234 Moscow,

Russia

Abstract. Given the significant time and financial costs of developing a commer-
cial drug, it remains important to constantly reform the drug discovery pipeline
with novel technologies that can narrow the candidates down to the most prom-
ising lead compounds for clinical testing. Computational approaches are used to
expedite the drug discovery processes. Semantic knowledge graphs can assist
these computational approaches, because they connect different biological data-
bases and reflect the relationships between genes, pathways and diseases. Here,
we took advantage of the Euretos Knowledge Platform (EKP), a commercial da-
tabase that integrates more than 170 different biological resources including
DrugBank, and evaluated the usefulness of the underlying semantic knowledge
graphs to predict novel drug-disease associations. We extracted network-based
features from the semantic knowledge graph and tested their ability to separate
between the positive and negative data sets of drug-disease pairs. Our results
showed that the extracted features such as the total number of intermediate con-
cepts (count), the number of different semantic categories (diversity), and the
predicates connecting a drug-disease pair were successful in separating the posi-
tive from the negative sets. These features provide a proof of concept for using
semantic knowledge graphs for drug repurposing efforts. Our work reveals the
added value of integrating different biological databases for solving complex bi-
ological questions.

Keywords: Drug repurposing, drug discovery, Semantic graphs, network min-
ing, machine learning.

1 Introduction

In silico methodologies are becoming more important in the modern-day drug discov-
ery pipeline. Computational drug discovery techniques accelerated the identification of
drug targets and significantly contributed to the different stages of drug development
[1]. Most efforts are concentrated into developing methods for the prediction of drug-
target interactions that mitigate the expensive costs of experimental drug development
2

and optimization [2]. Moreover, these methods are allowing for drug repurposing ef-
forts that identify new therapeutic applications for existing drugs and reduce research
cost and time due to the existing extensive clinical studies [2, 3].
Given that the majority of diseases cannot be explained by single-gene defects but
by the coordinated functions of their complex gene networks, drug development needs
to shift its attention towards understanding network-based perspectives of disease
mechanisms. Network-based approaches are providing important insights into the rela-
tionship between drugs and diseases. An investigation into the interaction between drug
targets and disease genes revealed that they are not closely related [4]. Additionally,
network-based approaches are showing promise in predicting novel targets and new
uses for existing drugs [5]. Current network-based approaches rely on drug target pro-
file similarities. These similarities are defined by either the number of targets two drugs
share or the shortest paths between their interactomes. However, these studies focus
only on using a limited number of databases related to protein drug targets, leaving a
large amount of rich data untapped.
Semantic and text-mining approaches that screen hundreds of thousands of pub-
lished literature articles have demonstrated the possibility of extracting concepts of bi-
ological meaning of various types. Semantic knowledge graphs are constructed to con-
nect concepts of various ty
pes based utilizing a number of resources such as literature knowledge and biological
databases. Such knowledge graphs can then be used to infer novel connections based
on network mining methods [6, 7]. In addition to semantic connections, large efforts
were made to integrate biological databases across gene, protein, pathway, disease and
drug domains. The Euretos Knowledge Platform (EKP, http://www.euretos.com/) is a
commercial database that integrates more than 170 different biological resources in-
cluding semantic data [http://www.euretos.com/files/EKPSources2017.pdf]. These
data sources are used by EKP to build a large network of connected biological concepts.
Disease and drug concepts in EKP are directly or indirectly connected based on prior
knowledge found in publications and/or other databases. We expect that leveraging a
large set of databases will enhance our drug discovery ability and avoid relying on a
single source of information to associate drugs to diseases. Each semantic type provides
us with an additional layer of information that can be exploited to identify novel drug
disease associations.

In this work, we have taken advantage of the EKP to evaluate the usefulness of the
underlying semantic knowledge graphs to predict novel drug-disease associations. With
the current exponential growth in biological data, semantic knowledge graphs have a
great potential for drug discovery.
3

2 Materials and Methods

2.1 Data Acquisition and Mapping in EKP

Drug disease pairs were acquired from Guney et al [8]. We specifically acquired the
drug disease associations based on their analysis. We had 403 pairs of 239 drugs and
78 diseases that formed our positive “gold-standard” (GD) data. By randomly shuffling
the 403 drug disease pairs of the positive dataset, we created 20 unique negative da-
tasets that included 403 random drug disease pairs not seen in the positive dataset. We
averaged the results of the negative datasets in the downstream analysis.

In EKP we first mapped the DrugBank IDs of the drugs in our datasets to drug con-
cepts in EKP. We used full disease names to map the diseases in our dataset to disease
concepts in EKP. Triples of drug disease pairs were identified in EKP if they were
directly connected by at least one of the resources used in EKP (Figure-1). Predicates
of drug disease triples were classified as “relevant” if they belonged to one of the fol-
lowing categories: “treats”, “affects”, “prevents”, “disrupts”. The LUMC has a local
installation of this knowledge graph for research purposes.

2.2 Network Features

Network features were calculated for the intermediate concepts connecting drug disease
pair. To evaluate if we could use the indirect associations to predict novel associations
between drugs and diseases, we used the positive and negative datasets as follows. For
each indirect association, we calculated a number of features and tested if these features
could separate the two datasets. These features were calculated for each semantic sub-
category (SubSemantic) available in EKP.

I. Count_normalized referred to as count in the following text:

(“SubSemantic_typeY”) = X ÷ (y × z) (1)

X = total number of SubSemantic_typeY connecting the drug (y number of unique drugs
making one drug concept) with disease (z number of unique diseases making one dis-
ease concept). The number of intermediate concepts between the drug and disease con-
cepts was normalized by the multiplication of y and z.

II. Diversity = The total number of unique SubSemantic categories connecting the
drug and disease concepts per semantic type.
4

III. Predicates from the drug concept to the intermediate concept and from the in-
termediate concept to the disease concept were combined and referred to as “predicate
path”. We used the Chi-square test to identify, within each semantic subcategory, the
most enriched paths in the GD vs the negative dataset (cutoff p-value < 0.05). We fil-
tered out paths that made up less than 1% of the total amount of paths within each
semantic subcategory.

For I and II we used the Kolmogorov–Smirnov to test the similarity of the distribu-
tion of scores between the positive and the negative datasets (cutoff p-value < 0.05)

3 Results and Discussion

3.1 Concept Mapping and Direct Associations

We acquired the dataset of curated drug disease relationships (drugs used in the treat-
ment of certain diseases) from Guney et al [8]. The GD dataset included 239 drugs, 78
diseases and 403 drug-disease pairs. For the negative dataset, we reshuffled the GD into
20 random datasets. The results of the negative datasets were averaged and compared
to the GD.

We used DrugBank IDs available in the GD dataset to map drugs from the GD and
negative datasets into EKP concepts and we used the full disease name to map diseases,
since no unique identifier was supplied in the GD dataset. Out of 239 drugs, 235 were
mapped successfully. All diseases were mapped successfully into EKP. Whendisease
or drug term mapped to more than one concept in the EKP, this was corrected for (Fig-
ure-1).

Using the EKP we retrieved the triples for drug-disease pairs found in the GD and
negative datasets. Each semantic triple consists of a subject-predicate-object, where the
subject and the object refer to the drug and the disease respectively, and the predicate
refers to the relationship connecting them. From the pairs found in the GD, 83%
mapped to a triple in the EKP, whereas in the negative datasets 22% of the pairs mapped
to a triple in the EKP. Moreover, from the mapped triples in the GD, 90% had a pred-
icate type that we consider positive for a drug-disease association i.e. ‘treats’, compared
to 75% in the negative datasets. These results demonstrate that the drug disease pairs
in the GD and the negative datasets are different in two main aspects. 1). Most of the
GD drug disease pairs could be represented in direct triples owing to prior knowledge
of the pair’s relationship. 2). The type of the predicates is different when comparing the
triples of the GD and negative datasets, where the GD contains a higher proportion of
the “relevant” predicates. The observed 22% of random drug disease pairs that mapped
to triples in EKP could be explained by the smaller proportion of “relevant” predicates
5

compared to GD. These triples would contain negative drug disease indications or a
drug that treats a side symptom of the disease.

3.2 Evaluating the Indirect Drug-Disease Associations

As we are interested in drug repurposing, we were looking for novel associations be-
tween drugs and diseases. We utilized the indirect drug disease associations as a basis
for our method, where we aim to mine the full EKP graph of indirect drug disease
associations for strong candidates using network based features. To identify which fea-
tures are useful, we used the GD and the negative datasets and evaluated several net-
work features on the indirect associations retrieved from them. In the EKP, 14 semantic
types are defined based on the semantic groups as defined by the Unified Medical Lan-
guage System [9], with a number of semantic subcategories under each semantic type.
Our analysis of indirect associations, i.e. drugs and diseases that connected via a third
concept, was done per subsemantic category.

All 403 drug-disease associations in the GD and negative dataset were connected by
at least one intermediate concept from the semantic types available in EKP. Out of the
14 possible semantic categories, 12 were found to connect a drug and a disease. We
next evaluated which semantic and semantic subcategories were the most informative.
Using the count diversity feature, defined as the total number of a certain intermediate
concept connecting a drug disease pair, the semantic type ‘Chemicals & Drugs’ was the
most informative intermediate semantic type and distinguished the positive and nega-
tive sets best (Kolmogorov-Smirnov p-value: 7.4. 10-23). Density plots of the count val-
ues per semantic and semantic subcategory in both the GD and the negative data reveal
visually that the GD contained a higher number of indirect concepts in most semantic
categories compared to the negative dataset, such as “Chemicals & Drugs”, “Anatomy”,
“Disorders” and “Procedures” semantic categories (Figure-2A, Table-1).

Another feature we investigated was the diversity of the different semantic types
connecting a drug disease pair. In this analysis we compared the total number of unique
semantic categories and semantic subcategories in the drug disease pairs of the GD and
negative datasets. As observed for the count feature, the GD drug disease pairs dis-
played a higher semantic diversity in their intermediate concepts (Figure-2B).

We also investigated the predicate types that connect the indirect concept with the
drug disease pairs. In this analysis we used two predicates, the one connecting the drug
with the intermediate concept and the one connecting the intermediate concept with the
disease concept. The combination of these two predicates in this order is referred to as
the predicate path. Using the chi-squares test we investigated if there were predicate
paths that are enriched in the GD and negative datasets. We found the most enriched
paths in the “Amino Acid, Peptide or Protein” and “Pharmacologic Substance” seman-
6

tic subcategories (Figure-2C). For example, the path “drug is compared with  Phar-
macologic Substance  treats Disease” that belongs to the “Pharmacologic Sub-
stance” semantic subcategory is strongly enriched in the GD that can be interpreted as
drugs that are known to be similar in function or chemical properties can be repurposed
for the same disease.

These results indicate that the type of, count and the predicates relating to the inter-
mediate concepts connecting a drug and a disease pair were informative in differentiat-
ing positive and negative datasets. The added values of using a diverse set of semantic
categories was demonstrated. In the count feature, we found almost all semantic cate-
gories shifted towards higher values in the GD when compared to the negative data.
Additionally, the diversity feature revealed that the GD tends to have a higher number
of semantic categories and subcategories as intermediate concepts connecting drugs
and diseases. Having the ‘Chemicals & Drugs’ as the most differentiating semantic
category also demonstrates the importance of looking at drug properties and not com-
pletely relying on the drug targets.

In contrast to other tools, our methodology is different in a number of ways. The
quantity and diversity of databases that we included is larger and the content much
richer than other comparable tools. In terms of quantity we have taken advantage of
EKP that integrates more than 170 resources. Other network-based tools such as SLAP
[6] and ProphNet [10] include 17 and 3 databases respectively. In terms of diversity,
EKP includes databases that span drug, disease, phenotype, protein, gene and molecular
pathways. Additionally, EKP takes advantage of mining the PubMed published litera-
ture. To our knowledge this is the most resource inclusive effort in network-based drug
disease associations. Our methodology utilizes drug disease connections beyond the
commonly used drug-targets-disease framework to expand the possibilities to include
other semantic categories, such as drug-drug and disease-disease similarities, pheno-
types, pathways, proteins and biological function annotations. Our method utilizes se-
mantic knowledge graphs properties and can be extended to other semantic knowledge
graphs that contain drug and disease concepts.

4 Conclusions

Computational efforts in drug discovery are gaining popularity for their ability to re-
duce the costs involved in drug development. Network-based approaches are currently
being used for drug repurposing efforts. We have taken advantage of the EKP that in-
tegrates more than 170 biological sources. Leveraging 12 semantic categories that are
found in the EKP to connect drug and disease pairs, we identified three main network
features that showed significant differences in the characteristics of the intermediate
concepts connecting the drug disease pairs in the Gold Standard and negative datasets.
These features can be readily used to build a classifier that will mine the full EKP graph
7

to propose novel drug disease associations. Additional network features that are tailored
to specific semantic types can be further extracted to fine tune the performance of the
classifier.

This work demonstrates that semantic knowledge graphs have a strong potential in
mitigating drug discovery efforts. We expect semantic graphs to grow with the expo-
nential growth in data generation in life sciences. Thus, rendering semantic knowledge
graphs even more valuable for drug discovery.

Table 1. Top 5 most significant semantic subcategories based on count feature.

Semantic type(subcategory) Semantic Subcategory Kolmogorov-Smirnov p-value
Organic Chemical Chemicals & Drugs 7.39E-23
Pharmacologic Substance Chemicals & Drugs 1.09E-22
Indicator, Reagent, or Diag-
Chemicals & Drugs 7.12E-19
nostic Ai
Hazardous or Poisonous Sub-
Chemicals & Drugs 1.2E-16
stance
Chemical Viewed Structurally Chemicals & Drugs 2.72E-16

Fig. 1. We have used the Euretos Knowledge Platform (EKP) as the semantic knowledge graph
in this analysis. Biological concepts (e.g. drugs, diseases, genes) are represented as circles, with
different colors suggesting the variety of semantic types in the EKP (A). The drug disease pairs
we acquired from an independent source were mapped to EKP concepts (B). Notably, mapped
pairs were connected by intermediate concepts of 12 out of 14 different semantic types. We ex-
tracted network features from the intermediate concepts (C) to use them in building a classifier
(future work) (D) to predict novel drug disease associations in the semantic network (E). Black
dashed line reflects ongoing parts D and E of which their results are not included in this manu-
script.
8

Fig. 2. A). Density plots of the count feature of three semantic subcategories. The
higher the count value on the x-axis the higher the higher this semantic subcategory is
found as an intermediate concept between drug and disease pairs.
B). Boxplots representing the diversity feature for concepts in each of the 12 semantic
categories. For each semantic category, we have calculated the presence of each of the
subcategories belonging to that semantic category.
C). Word Cloud representation of predicate paths of the “Pharmacological Substance”
semantic category. P-values of the chi-square test residuals were used as an input to the
cloud to calculate the enrichment of each path in either the positive and the negative
datasets.

5 Acknowledgments
The research leading to these results has received funding from the People Program (Marie Cu-
rie Actions) of the European Union’s Seventh Framework Program FP7/2077-2013 under REA
grant agreement no. 317246. In addition, the European Commission (FP-7 project RD-Connect,
grant agreement No. 305444).

6 Competing Interests

Kristina M. Hettne has performed paid consultancy since November 1, 2015, for Eure-
tos b.v, a startup founded in 2012 that develops knowledge management and discovery
9

services for the life sciences, with the Euretos Knowledge Platform as a marketed prod-
uct

7 References

1. Choi, S., Macalino, S.J.Y., Cui, M., Basith, S.: Expediting the Design, Discovery, and Devel-
opment of Anticancer Drugs using Computational Approaches. Curr. Med. Chem. (2016).
2. Glick, M., Jacoby, E.: The role of computational methods in the identification of bioactive
compounds. Curr. Opin. Chem. Biol. 15, 540–546 (2011).
3. Perlman, L., Gottlieb, A., Atias, N., Ruppin, E., Sharan, R.: Combining drug and gene simi-
larity measures for drug-target elucidation. J. Comput. Biol. 18, 133–145 (2011).
4. Yildirim, M.A., Goh, K.-I., Cusick, M.E., Barabási, A.-L., Vidal, M.: Drug-target network.
Nat. Biotechnol. 25, 1119–1126 (2007).
5. Wu, Z., Wang, Y., Chen, L.: Network-based drug repositioning. Mol. Biosyst. 9, 1268–1281
(2013).
6. Chen, B., Ding, Y., Wild, D.J.: Assessing drug target association using semantic linked data.
PLoS Comput. Biol. 8, e1002574 (2012).
7. Hettne, K.M., Thompson, M., van Haagen, H.H.H.B.M., van der Horst, E., Kaliyaperumal,
R., Mina, E., Tatum, Z., Laros, J.F.J., van Mulligen, E.M., Schuemie, M., Aten, E., Li, T.S.,
Bruskiewich, R., Good, B.M., Su, A.I., Kors, J.A., den Dunnen, J., van Ommen, G.-J.B., Roos,
M., ’t Hoen, P.A.C., Mons, B., Schultes, E.A.: The Implicitome: A Resource for Rationalizing
Gene-Disease Associations. PLoS One. 11, e0149621 (2016).
8. Guney, E., Menche, J., Vidal, M., Barábasi, A.-L.: Network-based in silico drug efficacy
screening. Nat. Commun. 7, 10331 (2016).
9. McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS semantic types for reducing
conceptual complexity. Stud. Health Technol. Inform. 84, 216–220 (2001).
10. Martínez, V., Cano, C., Blanco, A.: ProphNet: a generic prioritization method
through propagation of information. BMC Bioinformatics. 15 Suppl 1, S5 (2014).