=Paper=
{{Paper
|id=Vol-2042/paper44
|storemode=property
|title=Clustering Cancer Drugs According to their Mechanisms of Action
|pdfUrl=https://ceur-ws.org/Vol-2042/paper44.pdf
|volume=Vol-2042
|authors=Syed Abdullah Ali,Riza Theresa Batista-Navarro
|dblpUrl=https://dblp.org/rec/conf/swat4ls/AliB17
}}
==Clustering Cancer Drugs According to their Mechanisms of Action==
Clustering Cancer Drugs According to their Mechanisms of Action Syed Abdullah Ali and Riza Batista-Navarro University of Manchester, Oxford Road, Manchester, M13 9PL, UK syeabdullah.ali@postgrad.manchester.ac.uk, riza.batista@manchester.ac.uk Abstract. This research investigates similarities between cancer-related drugs with respect to their mechanisms of action (MOA) guided by infor- mation extracted from two resources: scientific literature and ontologies. To find similarity between drug pairs, the Chemical Entities of Biological Interest (ChEBI) ontology and Gene Ontology (GO) were leveraged to compute drug and drug target features, respectively. A graph of drugs was formed based on drug pairs clustered using two unsupervised graph clustering algorithms: Chinese Whispers and Louvain Method. As a re- sult of clustering, drugs that share the same dominant MOA were placed in the same cluster. Additionally, the most prominent drugs in the en- tire graph and within each cluster were identified according to graph centrality measures. Quality of the clusters was assessed by calculating silhouette coefficient values, ensuring consistency of results generated by the two algorithms, and employing the help of a domain expert who carried out manual evaluation. Keywords: Mechanisms of Action, Drug Clustering, Chinese Whispers, Louvain Method, Cancer 1 Introduction Cancer, the abnormal growth of cells due to gene mutations, is one of the top causes of mortality. In 2014 alone, cancer claimed 163,000 lives in the UK, which means that on the average, 450 die every day [24]. One in every two people who were born after 1960 in the UK will be diagnosed with cancer at some point in his or her life [2]. With over 200 different types, cancer is usually progressive, chronic and has a mild phenotype making it hard to diagnose. The discovery of anti-cancer drugs has not been a successful endeavour. The rate at which new molecular entities (NMEs) are approved for clinical investi- gation is low. In the US, between the period of 2010 and 2014, the maximum number of approved NMEs was only 11 (in 2012) [14]. In addition to that, many drugs fail in clinical trials. The recorded success rate in cancer trials is three times lower than in clinical trials for cardiovascular diseases [13]. Numerous efforts have been carried out, based on different approaches, to facilitate anti-cancer drug discovery such as: proteomic analysis to highlight pro- tein drug targets [3], metabolic control analysis to highlight likely pathological metabolism targets [6] or development of drug ontology databases [7, 8] for in- depth understanding of mechanisms of action (MOA) of known drugs. A MOA is a biochemical interaction that results in a drug producing its desired therapeu- tic pharmacological effect [1]. It depicts events happening at the chemical level, including drugs binding with drug targets, as well as reactions with enzymes or receptor sites [15]. To understand these mechanisms, different approaches have been proposed, e.g., analysis of chemical properties of drugs as well as of drug targets, and the extraction of drug-protein relationships from scientific literature. However, no effort has attempted to combine these approaches to group drugs according to their MOA. Clustering drugs which share similar MOA streamlines the drug discovery process as it constrains the number of drugs that need to be investigated with respect to their drug targets. It also provides the foundation for understanding and advancing combinatorial drug therapies for cancer. To fill this research gap, we propose an approach that makes use of infor- mation from ontologies for clustering drugs in order to highlight similarities and differences between them according to a range of physiochemical properties. The approach is novel in that it seeks to identify groups or clusters of drugs which are similar in terms of three features: (1) chemical structure, (2) biological role and (3) drug target properties. In forming groups amongst a total of 831 drugs, we investigated the use of unsupervised graph clustering techniques. 2 Related Work Clustering drugs based on similarity is a well-known approach to drug discovery, although there is great variation in terms of how similarity measures have been defined. Gemma et al. 2006 [9], for instance, clustered lung cancer drugs based on their gene expression profiles and sensitivity to multiple lung cancer cell lines. The results suggested that one of the drugs acted particularly different than the rest of the drugs; hence this drug might be useful in second-line chemotherapy if it was not administered to a patient initially. A similar attempt was made in Uhr et al. 2015 [23], whose research focussed on clustering 37 breast cancer drugs based on their sensitivity to 42 breast cancer cell lines. This resulted in six clusters which highlighted relationships between drugs and their sensitivities. Meanwhile, Jeon et al. 2011 [11] employed k -means clustering on gene expres- sion data to understand changes in mitochondrial proteins brought about by mitochondrial DNA depletion. Their research revealed how cells compensate for mitochondrial DNA depletion and it also led to the identification of proteins that repair this depletion. Ross et al. 2000 [20] presented a clustering approach which analysed 60 cell lines (NCI-60) based on microarray analysis, i.e., the parallel monitoring of gene expression levels in thousands of genes within the context of a specific biological process across different environments or tissue samples (e.g., tumours). Hierar- chical clustering was performed to cluster cell lines and genes, separately. Their results showed that two of the breast cancer cell lines are similar to melanoma cell lines, suggesting a relationship between the two types of cancers. A data-driven approach was presented in Udrescu et al. 2016 [22] where in- formation on drug-drug interactions (DDI) from the DrugBank database was utilised to find communities based on Louvain Method [5]. Prominent drugs were ranked using graph centrality measures [16] such as degree, betweenness, closeness, eigenvector and PageRank centrality [17]. A total of 1,141 nodes (cor- responding to drugs) and 11,688 links (corresponding to DDIs) were divided into nine different clusters, each of which represented a specific class of drugs. For example, one cluster represented drugs that targeted the immune system; another pertained to drugs for the nervous system, and so on. Effectively, the research identified functional drug categories and relationships. While each of the related work presented above holds resemblance to our own research methodology in attempting to cluster drugs based on their chem- ical interactions, our work proposes a different set of features for highlighting similarities between drugs. Specifically, the work last mentioned above assumes all drug similarities to be of equal importance whereas we calculate similarity scores based on various features, described in the next section. 3 Methodology Our approach to clustering drugs according to their mechanisms of action is based on their similarity with respect to the following types of information de- rived from ontologies: (1) chemical structure, (2) biological roles, and (3) drug target properties. Guided by a set of biomolecular events automatically extracted from scientific literature, we formed a list of drugs and drug targets of interest, whose features were calculated with the help of two ontologies: the Chemical En- tities of Biological Interest (ChEBI) ontology [8] and the Gene Ontology (GO) [7]. The rest of this section describes in detail our processing pipeline, also de- picted in Figure 1. 3.1 Information extraction A set of biomolecular events, i.e., interactions between drugs and their targets, served as input to our clustering approach. These events were obtained using a text mining workflow developed using the Argo workbench [18] and employed in the work carried out by Zerva et al. [25]. Based on the processing of 6,529 full- text papers relevant to melanoma1 , a total of 3,168 events were extracted. Out of these, 293 pertained to interactions between drugs (which we will henceforth refer to as drug-drug interactions or DDIs) while 2,875 are drug-protein interactions (DPIs). A list of 831 drugs was formed upon combining all unique drugs in the DDI and DPI sets. For each drug in the DPI set, we also created a list of drug targets. 1 Retrieved by calling the Europe PubMed Central API with a query containing “melanoma”, its synonyms and names of melanoma cell lines Fig. 1. Our processing pipeline for clustering drugs. 3.2 Drug feature extraction ChEBI is an ontology that catalogues chemical structures and biological roles of drugs. For each drug of interest in our list, we obtained related terms from the “chemical structure” and “biological role” sub-trees of the ontology. Struc- tural information among the tree terms was discarded and only terms appearing between the drug and three levels up were stored, corresponding to individual drugs2 . 3.3 Drug target feature extraction The Gene Ontology (GO) contains information on cellular components and molecular functions of genes, as well as the biological processes they are involved in. For each of the proteins that a particular drug interacts with—according to our list of DPIs—we obtained a list of gene products from GO. For each gene product, we obtained all related terms in GO up to a tree depth of three levels up. The result is a combined list of GO terms associated with a particular drug. A depiction of the different types of features extracted for any given drug is shown in Figure 2. 3.4 Drug similarity measurement Based on the features obtained in the steps described above, we measure the similarity between drugs. Specifically, given any two drugs, two types of similarity scores were calculated, i.e., the number of shared ChEBI terms and the number of shared GO terms. From a total of 423,801 drug pairs, there were 92,700 and 208,499 pairs that have a non-zero number of shared ChEBI and GO terms, respectively. The similarity score (i.e., the number of shared terms) between a pair of drugs based on ChEBI terms varied from 1 to 49 while that based on shared GO terms ranged from 1 to 8,526. In order to normalise them, we divided each score by the respective maximum value observed, and then multiplied the result by 100 to get a percentage. 2 The number of levels was constrained to three to ensure that succeeding steps will be computationally feasible. Fig. 2. Different types of features extracted from ontologies for any given drug. There are however, many drug pairs whose similarity scores were very low, indicating insignificant similarity between drugs. To overcome this problem, we retained only drug pairs whose ChEBI and GO similarity scores are both at least 10%. The threshold was set to this level based on our analysis of the data distribution which informed our decision on an optimal cut off. A combined similarity score is finally calculated by taking the mean of the two scores. If only one of the scores exists for a drug pair (e.g., in cases where a score was obtained by counting the number of shared ChEBI terms but none based on shared GO terms, or vice-versa), we take half of the value of the lone score as the combined similarity but only if it is at least 50%. Otherwise, the pair is discarded. After this step, only 261 drug pairs remained, having 89 unique drugs. The remaining drug pairs were combined to form an interconnected graph following the Yifan Hu representation [10]. In this graph representation, nodes represent drugs and weighted edges between nodes correspond to their similarity score. It is found that approximately around 20% of the nodes in the graph were not connected to the larger central interconnected graph. We consider them as pertaining to drugs which are extremely unique in terms of their MOA, or irrel- evant to cancer but were spuriously included during the information extraction step. These nodes were therefore filtered out, retaining only 72 nodes. 3.5 Drug clustering and ranking We investigated two algorithms for clustering the graph of drugs, namely, Chi- nese Whispers (CW) [4] and Louvain Method (LM) [5]. CW assigns a class label to each node based on the strongest weight of the neighbouring class, in itera- tive cycles. Meanwhile, LM optimises graph modularity, which is a measure of separation between densely connected regions of the graph. All drugs in the graph were then ranked with respect to their prominence3 in the whole graph according to: (1) degree, the number of immediate neighbors; 3 Pertains to importance based on their centrality within the graph (2) closeness, the distance between each node and all other nodes in the graph; (3) betweenness, the number of times a node is traversed along the shortest distance between each node and all other nodes in the graph; (4) PageRank, the probability distribution based on the likelihood of stopping on the node while traversing connected nodes; and (5) eigenvector, the principal eigenvector in the adjacency matrix of the graph. Additionally, drugs were also ranked within each cluster using the same centrality measures. 4 Experiments and Results Upon the application of Chinese Whispers and Louvain Method on our data, we obtained the results shown in Figures 3 and 4, respectively, following the Yifan Hu graph representation. For simplicity, each cluster is assigned a cluster identifier and represented by a unique colour. Its distribution, i.e., the proportion of the nodes in the graph that belongs to the cluster, is also indicated. Cluster ID Color Distribution CW C1 41.67% CW C2 27.78% CW C3 20.83% CW C4 9.72% Fig. 3. Result of drug graph clustering based on the Chinese Whispers (CW) algorithm. Cluster ID Color Distribution LM C1 40.28% LM C2 29.17% LM C3 20.83% LM C4 9.72% Fig. 4. Result of drug graph clustering based on the Louvain Method (LM) algorithm. Table 1. Ten most prominent drugs with respect to the entire graph, according to five centrality measures. Rank Degree Closeness Betweenness PageRank Eigenvec- tor 1 L-threonine L-gamma-glutamyl-L- L-threonine L-threonine sirolimus cysteine 2 sorafenib linifanib sirolimus Sorafenib L- threonine 3 sirolimus PD-153035 sorafenib L-serine sorafenib 4 curcumin ibrutinib curcumin curcumin curcumin 4 curcumin ibrutinib curcumin curcumin curcumin 5 L-serine ceritinib L-serine Sirolimus selume- tinib 6 vemurafenib GSK690693 tandutinib tandutinib vemu- rafenib 7 Glu-Phe-Val desmosine L- selumetinib L-serine phenylalanine 8 selumetinib L-seryl group selumetinib Glu-Phe-Val trametinib 9 Phe-Asn L-tyrosine ombitasvir vemurafenib PLX-4720 10 Thr-Ser Glu-Met trametinib Phe-Asn borte- zomib In both cases, the graph of 72 drugs was divided into four clusters of varying sizes. It is noticeable that similar clusters were produced by the two algorithms. We can consider the following as equivalent to each other: CW C1 and LM C1, CW C2 and LM C2, CW C3 and LM C3, and CW C4 and LM C4. While each of the first two cluster pairs has only a one-node difference, the latter two pairs correspond to exactly the same clusters. Table 1 presents the most prominent drugs within the entire graph while Table 2 provides a list of the same but only within each cluster. In both tables, drugs are ranked according to prominence, calculated based on our chosen graph centrality measures, as mentioned in Section 3.5. The three highest ranked drugs across the different centrality parameters are exactly the same corresponding to clusters from CW and LM except for one difference, shown in Table 2. 5 Analysis and Discussion The thresholds which were applied to ensure that only drugs with considerable similarity were retained—described in Section 3—significantly reduced the num- ber of drugs which formed the graph. Out of 831 drugs in our initial list, only 72 were included in the graph for clustering. As both CW and LM clustering algorithms are non-deterministic, they were applied on the data several times; results were consistent over the different runs. To assess the quality of our clustering results, we firstly computed the value of the silhouette coefficient [21], a widely accepted metric for judging the quality Table 2. Three most prominent drugs in each cluster obtained by Chinese Whispers and Louvain Method, according to five centrality measures. The cell corresponding to third rank of eigenvector-C2 is left empty as different results were obtained by CW and LM, i.e., ‘L-threonine residue’ and ‘bortezomib’, respectively. Clus- Degree Closeness Betweenness PageRank Eigenvector ter ID sorafenib (-)-cubebin sirolimus sorafenib sirolimus C1 sirolimus regorafenib sorafenib curcumin sorafenib curcumin cabozantinib curcumin sirolimus curcumin L- desmosine L-threonine L-threonine L-threonine threonine C2 L-serine L-seryl group L-serine L-serine L-serine L-pheny L-tyrosine L-phenylalanine L-phenylalanine - lalanine Glu-Phe- L-gamma-glut ombitasvir Glu-Phe-Val Thr-Ser Val amyl-L-cysteine C3 Phe-Asn Glu-Met Thr-Trp Phe-Asn Thr-Trp Thr-Ser Ala-Asp-Pro Thr-Ser Thr-Ser Glu-Phe-Val tandutinib PD-153035 tandutinib tandutinib tandutinib C4 afatinib linifanib afatinib afatinib afatinib ibrutinib ibrutinib ibrutinib GSK690693 ibrutinib of clusters where class labels are not predefined. Its value varies from -1 to 1, where positive values are taken to mean good clustering while anything less than zero is undesirable. Results obtained by CW and LM produced 0.294 and 0.292 as mean silhouette coefficient values, respectively. The cluster consistency is not very strong, but it is still substantial and falls within the acceptable range [12]. CW and LM use different techniques in generating clusters. While CW re- lies on random class distribution, LM is based on optimisation of modularity. We computed the value of Adjusted Rand Index (ARI), a measure of similarity between two clusterings, taking into account by-chance grouping [19]. The in- dex can take a value that varies from -1 to 1, where -1 corresponds to lack of correlation while 1 pertains to a perfect match. We obtained 0.95 as the value of ARI for the clusterings produced by CW and LM, which is very high. This strengthens our confidence in the results, as it suggests that the drug clusters are consistent even across different algorithms. Furthermore, our results were reviewed by a domain expert who was con- vinced by the results and suggested that it is worth pursuing further research into highlighting the dominant features for each cluster in order to uncover relevant mechanisms of action. He expressed confidence in the viability of the research and signified that it holds great potential. 6 Conclusion and Future Work In this paper, we proposed a novel approach to highlighting similarities between drugs, according to features derived from scientific literature and ontologies. Two unsupervised clustering methods were employed, namely, Chinese Whispers and Louvain Method. Upon the application of our approach to 72 drugs, the same four clusters were independently produced by each of the clustering methods. We then obtained a list of the most prominent drugs within the entire graph of drugs as well as within each cluster. The approach is a preliminary investigation and leaves room for improvement and future work. The most important next step is the experimental validation of the dominant MOA represented by each cluster using sources such as annotated data sets or manual annotation by domain experts. Other methods that could potentially improve the results include hierarchical clustering to explore sub- cluster relationships or soft clustering to take into account membership of a drug in multiple clusters. Feature engineering can also be extended, e.g., to increase tree depth when extracting terms from ontologies. Acknowledgment. The authors would like to express their gratitude to the National Centre for Text Mining (NaCTeM) for sharing their event extraction results. The authors also thank Prof. Ross King for his help in evaluating the clustering results. References [1] M.P. Adams, C.Q. Urban, Pharmacology: Connections to Nurs- ing Practice (Pearson Education, UK, 2015). ISBN 9780133896817. https://books.google.co.uk/books?id=usOgBwAAQBAJ [2] A. Ahmad, N. Ormiston-Smith, P. Sasieni, Trends in the lifetime risk of developing cancer in great britain: comparison of risk for those born from 1930 to 1960. British journal of cancer 112(5), 943 (2015) [3] A.I. Archakov, V.M. Govorun, A.V. Dubanov, Y.D. Ivanov, A.V. Veselovsky, P. Lewi, P. Janssen, Protein-protein interactions as a target for drugs in proteomics. Proteomics 3(4), 380–391 (2003) [4] C. Biemann, Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems, in Proceedings of the first workshop on graph based methods for natural language processing, Association for Computational Linguistics, 2006, pp. 73–80. Association for Computational Linguistics [5] V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of com- munities in large networks. Journal of statistical mechanics: theory and experiment 2008(10), 10008 (2008) [6] M. Cascante, L.G. Boros, B. Comin-Anduix, P. de Atauri, J.J. Centelles, P.W.-N. Lee, Metabolic control analysis in drug discovery and disease. Nature biotechnol- ogy 20(3), 243–249 (2002) [7] G.O. Consortium, et al., The gene ontology project in 2008. Nucleic acids research 36(suppl 1), 440–444 (2008) [8] K. Degtyarenko, P. De Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcántara, M. Darsow, M. Guedj, M. Ashburner, Chebi: a database and ontol- ogy for chemical entities of biological interest. Nucleic acids research 36(suppl 1), 344–350 (2007) [9] A. Gemma, C. Li, Y. Sugiyama, K. Matsuda, Y. Seike, S. Kosaihira, Y. Minegishi, R. Noro, M. Nara, M. Seike, et al., Anticancer drug clustering in lung cancer based on gene expression profiles and sensitivity database. BMC cancer 6(1), 174 (2006) [10] Y. Hu, Efficient, high-quality force-directed graph drawing. Mathematica Journal 10(1), 37–71 (2005) [11] J. Jeon, J.H. Jeong, J.-H. Baek, H.-J. Koo, W.-H. Park, J.-S. Yang, M.-H. Yu, S. Kim, Y.K. Pak, Network clustering revealed the systemic alterations of mitochon- drial protein expression. PLoS computational biology 7(6), 1002093 (2011) [12] L. Kaufman, P.J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, vol. 344 (John Wiley & Sons, ???, 2009) [13] I. Kola, J. Landis, Opinion: Can the pharmaceutical industry reduce attrition rates? Nature reviews. Drug discovery 3(8), 711 (2004) [14] D. Lu, T. Lu, M. Stroh, R.A. Graham, P. Agarwal, L. Musib, C.-C. Li, B.L. Lum, A. Joshi, A survey of new oncology drug approvals in the usa from 2010 to 2015: a focus on optimal dose and related postmarketing activities. Cancer chemotherapy and pharmacology 77(3), 459–476 (2016) [15] C. McQueen, Comprehensive Toxicology (Elsevier Science, ???, 2010). ISBN 9780080468846. https://books.google.co.uk/books?id=jzCAKsa2CpMC [16] M. Newman, Networks:An Introduction (OUP Oxford, UK, 2009). ISBN 9780191637766. https://books.google.co.uk/books?id=7LmNAQAACAAJ [17] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web., Technical report, Stanford InfoLab, 1999 [18] R. Rak, A. Rowley, W. Black, S. Ananiadou, Argo: an integrative, interactive, text mining-based workbench supporting curation. Database 2012 (2012) [19] W.M. Rand, Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66(336), 846–850 (1971) [20] D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellman, V. Iyer, S.S. Jeffrey, M. Van de Rijn, M. Waltham, et al., Systematic variation in gene expression patterns in human cancer cell lines. Nature genetics 24(3), 227 (2000) [21] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, 53–65 (1987) [22] L. Udrescu, L. Sbârcea, A. Topı̂rceanu, A. Iovanovici, L. Kurunczi, P. Bogdan, M. Udrescu, Clustering drug-drug interaction networks with energy model layouts: community analysis and drug repurposing. Scientific reports 6 (2016) [23] K. Uhr, W.J. Prager-van der Smissen, A.A. Heine, B. Ozturk, M. Smid, H.W. Göhlmann, A. Jager, J.A. Foekens, J.W. Martens, Understanding drugs in breast cancer through drug sensitivity screening. SpringerPlus 4(1), 611 (2015) [24] C.R. UK, Cancer Statistics for the UK. Accessed: 2017-10-05. http://www.cancerresearchuk.org/health-professional/cancer-statistics-for- the-uk [25] C. Zerva, R. Batista-Navarro, P. Day, S. Ananiadou, Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics, 466 (2017)