Designing of Information Model for Prediction of Drug-drug
Interactions based on Calculation of Target and Therapeutic
Similarity
Olha Marushchaka, Rostyslav Kosarevycha
a
    Lviv Polytechnic National University, Stepan Bandera street 12, Lviv, 79016, Ukraine


                Abstract
                Understanding and predicting the drug–drug interactions is an important in both drug
                development and clinical application, especially for co-administered medications. We
                propose a new information model for drug–drug interaction analysis, based on the common
                biological targets and therapeutic similarity. Based on the data from DrugBank, our model
                calculates target similarity and therapeutic similarity features. To predict the possible drug-
                drug interactions it uses a semi-supervised approach, defined in two steps: adding the
                missing labels using the clustering algorithm K-Means, and then, executing a classification
                with a supervised learning model Support Vector Machine. Proposed model is tested for
                the known data set and had shown the high classification rate, with the AUC=98.5+-0.05.

                Keywords 1
                Machine learning, predictive model, drug-drug interactions, drug-related data, data analysis,
                target similarity, therapeutic similarity

1. Introduction
    Identification and prediction of drug-drug interactions (DDI) is a widespread topic of the research
in the healthcare, and studying such aspect is a big part of the drug development process [1]. Drug-drug
interactions occur when two or more drugs react with each other and are vital for the patient safety and
success of treatment modalities, they can lead either to the loss of efficacy an adverse drug reaction, or
cause the increasing of the therapeutic effect [1, 2]. DDIs can be categorized into three types:
pharmaceutical, pharmacokinetic and pharmacodynamic [3].
    A number of computational methods have been employed for the prediction of DDIs based on drugs
structures and/or functions: physiologically based pharmacokinetic modeling, molecular structural
similarity analysis, ontology and annotation-based analysis, network modeling, QSAR modeling [4].
    We can divide the machine learning-based methods used for the prediction of DDIs according to the
approach used: unsupervised, supervised, and semi-supervised machine learning-based algorithms [21].
    In one study [4] it was proposed to use an unsupervised machine learning model for predicting DDIs
using the structural similarities of drugs from the Pharmacokinetic and Pharmacodynamic networks and
investigated the factors influencing DDIs for further improvement of the predictions. In other study [5],
the drug-target pairs were predicted, resulting in a network with strong local clustering of similar types
according to Anatomical Therapeutic Chemical (ATC) classification. In other studies, it was used the
genomic data and the drug structural characteristics, or the physical and chemical features of drug
molecules to create different hypothesis on the possible DDIs and proceed the unsupervised machine
learning approaches [8, 10, 11].


IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, November 19–21, 2020, Växjö, Sweden
EMAIL: olha.marushchak.w@gmail.com (O. Marushchak); kosar2311@gmail.com (R. Kosarevych)
ORCID: 0000-0001-5620-1299 (O. Marushchak); 0000-0001-9108-0365 (R. Kosarevych)
             ©️ 2020 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
    The aforementioned studies proceeded a vast amount of data [4,8,19]. The research objectives
mostly included investigating the underling mechanisms of possible drug-target and drug-drug
interactions [11, 19, 20]. However, it was noticed that the known DDIs were not taken into consideration
for the unsupervised learning models [4, 5, 11]. We assume that the known DDIs are a valuable piece
of information, as their characteristics can serve as a benchmark for not yet discovered combinations.
    Several studies focused on predicting the DDI though protein-protein-interaction networks with
chemical features [3, 6, 7], implementing supervised learning models to the labeled data. The similarity-
based approach has been used to predict the possible outcomes of combining drug pairs [10, 14, 15].
    Only few of the studies proceeded the ‘in vitro’ experiments to evaluate their models [9, 11, 12],
other evaluated the performance of their methods by comparison of the predicted DDI and the reported
ones in the literature [4, 14, 16].
    We have observed that the supervised learning models analyzed significantly smaller amount of
possible DDIs [15, 16]. It was caused by the relatively small number of the known DDIs in the literature,
and therefore, picking the corresponding amount of the drug pairs without indication of possible
interaction in the literature. Besides, we assume that there should be more complex procedure of data
labeling, because it improves the performance of the next predictions [12].
    So, predicting DDIs is a complex problem [1, 3] that requires addressing it from the medical
perspective – in a form of creating a hypothesis and picking suitable drug-related characteristics; and
from the computer science perspective – by choosing the appropriate computational methods and
predictive models.
    In this study, we propose a new information model for drug–drug interaction analysis, based on the
common biological targets and therapeutic similarity. The information model is able to proceed the data
extraction from the source, execute the calculation of the features, execute the data labeling process and
make predictions of the possible DDIs.
    Regarding that many researchers obtained data for their investigations of DDI from a database
DrugBank [17], we used DrugBank as the data source for our work as well. Also we used the calculation
methods of target and therapeutic similarities features proposed by Cheng et al. [l5, 16] – such approach,
combined with the supervised learning algorithm, Support Vector Machine has shown a significant
accuracy in predicting DDI on the sample.
    We examined the hypothesis of predicting the based on the common biological targets and
therapeutic use, instead of including chemical and physical descriptors of the drugs.
    We addressed the following problem: the researchers added labels meaning the absence of DDI
when the drug pair didn’t have the DDI indication in the data source, however there might be the
unreported, and used only 3% of the original input data. In this study we want to improve the data
labeling process, and that would enable us to use the whole dataset as well for the next predictions. So,
we followed a semi-supervised approach, which consists of first, clustering algorithm for obtaining the
missing data labels, and then, classification to predict the possible drug-drug interactions.

2. Methods and Materials

2.1.    Obtaining Input Data
    We used a database DrugBank as a source of data. Drug bank is freely accessible, online database
containing information on              drugs      and drug targets [17]. It contains both
bioinformatics, cheminformatics details about drugs such as resource, chemical, pharmacological and
pharmaceutical data with comprehensive drug target (sequence, structure, and pathway) information.
    Obtaining data for the study was performed by parsing the xml document. We extracted the
following drug information: drug name, DrugBank ID of the drug, targets, ATC code (anatomical-
therapeutic-chemical classification), known drug-drug interactions. From the obtained set of drugs and
characteristics, we removed drugs that did not contain ATC code (experimental drugs, homeopathic
and herbal traditional medicinal products) as well as drugs of antibodies and inorganic salts.

2.2.     Designing of the Information Model
   The information model was designed to proceed all the steps required for the drug-related data
analysis and prediction of the possible drug-drug interactions.


Figure 1: Schema of the process held within the information model

   All possible unique drug combinations were composed, and the calculation of the target similarity
and therapeutic similarity features was performed.
   For the target similarity (SB) feature we used an approach proposed by Cheng et al. [16]. We
summarized all the unique biological targets identified for the drugs, added them to the general
sequence, and created binary vectors for each drug. If the drug affects a biological target - then a certain
element of the vector contains a value of 1, if the drug has no effect on the target - the value is 0. After
that, for each combination of drugs, we constructed the target similarity by calculating the Tanimoto
coefficient for binary vectors of drugs:
                                                      𝑁𝑎𝑏                                              (1)
                                   𝑆𝐵(𝑎, 𝑏) =                  ,
                                               𝑁𝑎 + 𝑁𝑏 − 𝑁𝑎𝑏
where 𝑁𝑎𝑏 represents the quantity of the biological targets, which are common for both drugs of the
combination;
𝑁𝑎 represents quantity of the biological targets, which the drug a affects;
𝑁𝑏 represents quantity of the biological targets, which the drug b affects.
   For the therapeutic similarity feature (ST) we used the method proposed by Cheng et al. [15]. We
created five sets with unique ATC codes representing each of the five ATC classification levels for each
drug pair. Next, for each drug pair for each ATC classification level the therapeutic similarity feature
was calculated:
                                             𝐴𝑇𝐶𝑘 (𝑎) ∩ 𝐴𝑇𝐶𝑘 (𝑏)                                       (2)
                               𝑆𝑇𝑘 (𝑎, 𝑏) =                        ,
                                             𝐴𝑇𝐶𝑘 (𝑎) ∪ 𝐴𝑇𝐶𝑘 (𝑏)

where k represents an ATC classification level (from 1 to 5);
𝐴𝑇𝐶𝑘 (𝑎) represents ATC codes of the k-level for the drug a;
𝐴𝑇𝐶𝑘 (𝑏) represents ATC codes of the k-level for the drug b.
   After that, the general therapeutic similarity was calculated considering all five ATC classification
levels:
                                                ∑𝑛𝑘=1 𝑆𝑇𝑘 (𝑎, 𝑏)                                   (3)
                                   𝑆𝑇(𝑎, 𝑏) =                    ,
                                                       𝑛
where n represents the overall number of ATC classification levels (=5);
𝑆𝑇𝑘 (𝑎, 𝑏)represents the previously calculated therapeutic similarity for each ATC classification level.
    The drug pairs that were indicated in the DrugBank as known, received the label 1. For the remaining
drug combinations there is not enough information in DrugBank to assert or deny the drug-drug
interaction, so no labels were added.
    In order to predict the labels for the drug pairs that contained no label, the clustering method K-
Means was applied. We have chosen K-Means method because it has been used by the researchers in
the healthcare fields as the first step of the semi-supervised machine learning approach to define the
missing data labels [10, 18].
    In K-Means method the centroids are randomly initialized from the dataset. Then, from each centroid
the Euclidean distance is calculated to each data point, and depending upon the minimum distance
between the centroids and data points, that data point is assigned to that centroid. This is repeated until
there is no change of the centroids. In this way, the clusters are formed.
    The accuracy of the clustering was calculated according to the percentage of how much of the
clustered labels 1 match the original labels 1 for the drug pairs.
    For the predicting of the possible drug-drug interactions, we used the supervised machine learning
model Support Vector Machine (SVM), namely Linear Support Vector Classification. We made our
choice based on the literature review: in drug-related research this method is used to solve classification
problems. In the studies we investigated, such method has shown significantly good performance [14,
16].
    The AUC value was calculated, and the confusion matrix was composed to evaluate the performance
of the model.
    The research was performed using the programming language Python3. The xml parsing was
proceeded using the library by using library xml.etree.ElementTree in Python3. The data analysis was
performed using libraries numpy, pandas, sklearn. Data visualization was executed using libraries
matplotlib, seaborn.
    We used open-source software which is freely available and contributed by the global community
of developers.

3. Results
   From DrugBank we obtained information about 721 drugs, which has been used as an input for the
information model.
   For the feature construction, 266085 unique drug-drug combinations were created. The target and
therapeutic similarities were calculated and assigned to the corresponding drug pairs. 6946 drug pairs
were actually indicated in DrugBank as having the drug-drug interaction, so they received the label 1.
   Whole dataset was used as an input for the clustering algorithm, with k=2 clusters. The accuracy
calculated with our method is 54%.
   After that, for all known drug combinations that contained the label 1 before clustering, we left the
original labels, and for the drug combinations with the missing ones, we assigned the labels obtained
as a result of clustering.
   We investigated the distribution of each feature according to their labels.


Figure 2: Distribution of the feature Target Similarity
Figure 3: Distribution of the feature Therapeutic Similarity

   The distribution is binomial, the values are contained in range [0, 1]. It is noted that the drug
combinations with label 1 have two density peaks in the area 0.3 and 0.65; and the drug combinations
with label 0 have two density peaks in the area 0 and 0.175.
   To execute the classification, the whole dataset was splitted into the train set (70%) and test set
(30%). Such splitting ratio (70/30) has been widely used by the scientific community for data analysis.
In our research we didn’t notice the significant difference of performance with various splitting, but
with the 70/30 ration the AUC value was the highest – 98.53 (Comparing to AUC=98.39 for 90/10,
AUC=98.41 for 80/20, AUC=98.47 for 60/40).
   We applied the Linear Support Vector Classification method of the Support Vector Machine
algorithm with the linear kernel.
   Based on the prediction, we received the Area Under the Curve (AUC) value = 98.5+-0.05 illustrate
the absolute values of prediction of the training set, we composed the following confusion matrix:


Figure 4: Confusion matrix to evaluate the predictions of information model

   The count of predicted drug-drug interactions is 35 186. The number of correctly predicted drug
pairs that do not have drug-drug interactions is 43 469. There are 217 drug pairs with label 1 that were
predicted as such not having the drug-drug interaction. 954 drug pairs were wrongly predicted as having
the drug-drug interaction, although they were labeled as 0.

4. Conclusion
    The information system for the drug-related data analysis and the prediction of the possible drug-
drug combinations based on their calculated target and therapeutic similarities was created. It uses a
semi-supervised learning approach, in order to firstly, define the missing labels using the clustering
algorithm, and then, execute a classification using a supervised learning model.
    Our examined hypothesis to use data about biological targets and therapeutic use has received
reinforcement in the form of high predictive performance on the dataset from DrugBank, verified with the
test set.
   By executing the data labeling process, we were able to use for the further predictions all amount of
drug combinations, including the 97.39% that didn’t have the labels at the beginning.
   In the similar studies that use same dataset, included biological targets or therapeutic use into their
examined hypothesis, the results were AUC=0.968 [14] and AUC=0.912 [15]. So, our
implementation of the proposed information system has shown accuracy of classification about 98.5+-
0.05 (AUC) for the DrugBank dataset and it outperforms other similar systems.
      This information system can be enhanced with the functionality to calculate more features such
as enzyme similarity and transporter similarity.

5. Acknowledgements
   We thank Olga Boretska (Danylo Halytskyi Lviv National Medical University) for providing insight
and expertise that greatly assisted the research.

6. References
[1] Waters, Nigel J. "Evaluation of drug–drug interactions for oncology therapies: in vitro–in vivo
     extrapolation model‐based risk assessment." British journal of clinical pharmacology 79.6 (2015):
     946-958.
[2] U.S. Food and Drug Administration: Drug Interaction: what you should know. URL:
     https://www.fda.gov/drugs/resources-you-drugs/drug-interactions-what-you-should-know
[3] Huang, Jialiang, et al. "Systematic prediction of pharmacodynamic drug-drug interactions through
     protein-protein-interaction network." PLoS Comput Biol 9.3 (2013): e1002998.J. Cohen (Ed.),
     Special issue: Digital Libraries, volume 39, 1996.
[4] Takeda, Takako, et al. "Predicting drug–drug interactions through drug structural similarities and
     interaction networks incorporating pharmacokinetics and pharmacodynamics knowledge." Journal
     of cheminformatics 9.1 (2017): 1-9.
[5] Yıldırım, M., Goh, K., Cusick, M. et al. Drug—target network. Nat Biotechnol 25, 1119–1126
     (2007). https://doi.org/10.1038/nbt1338.
[6] Lei Huang, Fuhai Li, Jianting Sheng, Xiaofeng Xia, Jinwen Ma, Ming Zhan, Stephen T.C. Wong;
     DrugComboRanker: drug combination discovery based on target network analysis,
     Bioinformatics, Volume 30, Issue 12 (2015): i228–i236
[7] Zhao, Xing-Ming, et al. «Prediction of drug combinations by integrating molecular and
     pharmacological data» PLoS computational biology 7.12 (2011): e1002323.
[8] Chandrasekaran, Sriram, et al. "Chemogenomics and orthology‐based design of antibiotic
     combination therapies." Molecular systems biology 12.5 (2016): 872.
[9] Li, Xiangyi, et al. "Prediction of synergistic anti-cancer drug combinations based on drug target
     network and drug induced gene expression profiles." Artificial intelligence in medicine 83 (2017):
     35-43.
[10] Ferdousi, Reza, Reza Safdari, and Yadollah Omidi. "Computational prediction of drug-drug
     interactions based on drugs functional similarities." Journal of biomedical informatics 70 (2017):
     54-64
[11] Aghakhani, Sara, Ala Qabaja, and Reda Alhajj. "Integration of k-means clustering algorithm with
     network analysis for drug-target interactions network prediction." International Journal of Data
     Mining and Bioinformatics 20.3 (2018): 185-212.
[12] Chen, Xing, et al. "NLLSS: predicting synergistic drug combinations based on semi-supervised
     learning." PLoS computational biology 12.7 (2016): e1004975.
[13] Peng Li, Chao Huang, Yingxue Fu, Jinan Wang, Ziyin Wu, Jinlong Ru, Chunli Zheng, Zihu Guo,
     Xuetong Chen, Wei Zhou, Wenjuan Zhang, Yan Li, Jianxin Chen, Aiping Lu, Yonghua Wang;
     Large-scale exploration and analysis of drug combinations, Bioinformatics, Volume 31, Issue 12,
     15 June 2015, Pages 2007–2016
[14] Song, Dalong, et al. "Similarity‐based machine learning support vector machine predictor of drug‐
     drug interactions with improved accuracies." Journal of clinical pharmacy and therapeutics 44.2
     (2019): 268-275.
[15] Cheng, F., Li, W. et al. Prediction of polypharmacological profiles of drugs by the integration of
     chemical, side effect, and therapeutic space. Journal of chemical information and modeling, 53(4),
     (2013): 753-762.
[16] Cheng, F., & Zhao, Z. Machine learning-based prediction of drug–drug interactions by integrating
     drug phenotypic, therapeutic, chemical, and genomic properties. Journal of the American Medical
     Informatics Association, 21(e2), (2014): e278-e286.
[17] Wishart, David S., et al. "DrugBank: a knowledgebase for drugs, drug actions and drug
     targets." Nucleic acids research 36.suppl_1 (2008): D901-D906.
[18] Singh, Reetu, and E. Rajesh. "Prediction of Heart Disease by Clustering and Classification
     Techniques." International Journal of Computer Sciences and Engineering 7 (2019): 861-866.
[19] Li, Xiangyi, et al. «Biomolecular network-based synergistic drug combination discovery» BioMed
     research international2016 (2016).
[20] Wu, Zikai, Xing-Ming Zhao, and Luonan Chen. «A systems biology approach to identify effective
     cocktail drugs» BMC systems biology. Vol. 4. No. 2. BioMed Central, 2010.
[21] Li, Xiangyi, et al. "Biomolecular network-based synergistic drug combination discovery." BioMed
     research international 2016 (2016).