ChemoVerse: Manifold traversal of latent spaces for novel molecule discovery Harshdeep Singh1∗ , Nicholas McCarthy2 , Qurrat Ul Ain3 , Jeremiah Hayes2 1 Swiss Federal Institute of Technology, Lausanne, 2 Accenture Labs, Dublin 3 AI Innovation Lab, DSAI, Novartis harshdeep.harshdeep@epfl.ch, {nicholas.mccarthy,jer.hayes}@accenture.com, qurrat.ul ain@novartis.com Abstract embed the chemical properties of such into a vector space. By decoding from this ’latent’ space of chemical structure we can In order to design a more potent and effective generate new, previously unidentified chemical compounds. chemical entity, it is essential to identify molecu- In the domain of computational chemistry, Simplified lar structures with the desired chemical properties. molecular-input line-entry system or SMILES are a com- Recent advances in generative models using neural mon string textual method for encoding and representation of networks and machine learning are being widely molecular structures [Anderson et al., 1987]. This facilitates used by many emerging startups and researchers the use of models more commonly used in natural language in this domain to design virtual libraries of drug- processing. Therefore, SMILES strings have been used as like compounds. Although these models can help raw input strings to generative models, which are given the a scientist to produce novel molecular structures task of encoding and decoding the SMILES string directly rapidly, the challenge still exists in the intelligent [Anderson et al., 1987]. Advances were made by employing exploration of the latent spaces of generative mod- variational autoencoders (VAE), a neural network comprised els, thereby reducing the randomness in the gen- of an encoder that transforms a compound’s representation erative procedure. In this work we present a mani- into a compressed latent space, and a decoder that gener- fold traversal with heuristic search to explore the la- ates compounds from the latent space [Kingma and Welling, tent chemical space. Different heuristics and scores 2013]. Although, directed search of the resulting latent space such as the Tanimoto coefficient, synthetic accessi- is difficult. To counter this, conditional variational autoen- bility, binding activity, and QED drug-likeness can coders (CVAE)[Kang and Cho, 2018] were used in order be incorporated to increase the validity and proxim- to facilitate the generation of new molecules with specified ity for desired molecular properties of the generated molecular properties. This is achieved by incorporating the molecules. For evaluating the manifold traversal molecular properties of a compound into the encoder layer exploration, we produce the latent chemical space and helping in the generation of more drug-like molecules using various generative models such as grammar [Lim et al., 2018]. Generative adversarial networks (GANs) variational autoencoders (with and without atten- have also been applied in the same manner [Maziarka et al., tion) as they deal with the randomized generation 2020], and have been recently combined with reinforcement and validity of compounds. With this novel traver- learning and graph representation of molecules to optimize sal method, we are able to find more unseen com- the generation of molecules with specified molecular proper- pounds and more specific regions to mine in the la- ties [De Cao and Kipf, 2018]. tent space. Finally, these components are brought together in a simple platform allowing users to per- Discovery in the latent space generated by these models is form search, visualization and selection of novel often performed using random sampling and linear interpola- generated compounds. tion, primarily due to the ease of the implementation of these methods. However, this is not suitable for most generative models as their latent spaces are generally high dimensional 1 Introduction and sparse. While doing traversal, we will traverse regions where the data is not very well represented. In other words, Designing a new chemical entity is a time consuming, expen- this could lead to a ’dead zone’ as the space of molecular sive, and error-prone task. Pharmaceutical companies invest samples in the training dataset are present only on a subset of billions of dollars into screening vast libraries of chemical the latent space [White, 2016]. Hence, decoding a point from compounds for hit and lead identification [Fleming, 2018]. the latent space will end up returning noisy or invalid results. The past few years has seen the rise of deep generative models It can also be challenging to incorporate contextual domain that can operate over large spaces of molecular structures and information during search, and as a result discovery of com- ∗ This work was performed during an internship programme at pounds with specific properties is often very inconsistent. Accenture Labs, Dublin, Ireland. In this work we implemented various flavours of auto- Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Figure 1: Visualization of the latent space as a K-dimensional tree, while traversing the latent manifold of compounds used in the treatment of diabetes and lung cancer. encoders as the generative model for producing sets of latent ther nine modified in order to represent the more complex spaces, and in particular show that our novel implementation ChEMBL dataset. of Grammar VAE [Kusner et al., 2017] with an additional at- tention mechanism [Vaswani et al., 2017] is highly performa- 2.2 Latent Space Generation tive, with a low rate of invalid molecules generated. We also Three models are implemented in our system: a VAE introduce a novel manifold interpolation method employing [Kingma and Welling, 2013], a Grammar VAE [Kusner et al., the Riemannian metric [Arvanitidis et al., 2017] in conjunc- 2017], and a Grammar VAE with self-attention [Vaswani et tion with a set of molecular property heuristics to perform al., 2017]. As our search algorithm is model agnostic, latent directed search and interpolation of these latent spaces in or- spaces can be substituted with minimal effort. However, re- der to design novel molecules with desired properties. This sults will vary depending on the underlying data, model archi- combination of generation and exploration of latent space has tecture, and training parameters used to generate each latent enabled us to not only design molecules which have not been space. seen before but also to explore new regions of latent chemical The ChEMBL dataset is less standardized and contains space where more potent chemical compounds may exist. more complex molecules, therefore we perform transfer learning by initially training each model on the ZINC dataset 2 System Architecture for 50 epochs, then switching to the ChEMBL dataset for 50 epochs. Training, validation and test sets of 85%, 10%, and In this section we describe the various components of our 5% respectively was used, with the test set comprised entirely system architecture: generation of latent spaces and our al- of ChEMBL molecules. We use the Adam optimizer with gorithm for manifold traversal. a learning rate scheduler that is instantiated after 15 epochs with a factor of 0.1 (initialized at 0.001). 2.1 Data The encoder is comprised of three 1D convolutional layers We used a dataset of 250,000 molecules drawn from the ZINC with filters of size 9, 10 and 11 respectively, while the de- dataset [Irwin et al., 2012], and an additional 100,000 drawn coder is comprised of 3 gated recurrent units (GRU) of 501 from the ChEMBL dataset [Gaulton et al., 2017]. These units [Kusner et al., 2017]. The structural validity of the gen- two datasets are comprised of commercially available drug erated compounds are checked using the open-source RDKit molecules and have been used in related work using models library [Landrum, ]. Examples of test set compounds that like variational autoencoders (VAEs) [Gómez-Bombarelli et have been encoded and decoded are shown in table 1. al., 2018]. Molecules are represented in canonical SMILES As noted in the original Grammar VAE work [Kusner et string format, and are further processed into 1) a one-hot char- al., 2017] the vanilla VAE architectures encoding SMILES acter encoding and 2) a set of context-free grammar (CFG) strings directly generally produce a very low valid decode rules. Grammar rules are obtained from the OpenSMILES rate on larger molecular datasets - just 17% using a condi- specification [James and Dalke, 2016], which denotes how tional VAE under their bayesian optimization search method- the SMILES representation was formed based on the rules. ology. By instead using a Grammar VAE to generate produc- This context free grammar (CFG) consists of 76 production tion rules of a grammar instead of SMILES strings directly a rules, to which an additional seven are added, and a fur- much higher rate of valid compounds was attained. However, Figure 2: Visualization of the path in the latent space along which our algorithm is interpolating. On the left are generated compounds not present in the training data. On the right are histograms of different molecular properties, including synthetic accessibility, activity scores. Algorithm 1 Interpolation with manifold traversal sume that the latent space is Euclidean and flattened out, and Input: latent space L, source and destination points s and d, generally produce noisier results [Arvanitidis et al., 2017]. points of interest R, path length m In our algorithm, source and destination points are selected Parameter: k heuristics and associated edge weights W k in the latent space; these can either be latent space encod- Output: a path of m equidistant points from s to d ing of single molecules, or cluster centroids of molecules la- for each reference point i in R do beled with a desired property. Points of interest are the set Find n nearest neighbors of i → Ni of known molecules with desired properties, for example all for each (i, j), where j ∈ Ni do molecules used in treatment of a specific condition. The goal 1. Calculate Jacobian matrix Jij of decoder output (dec) is to define a path from source to destination points in the w.r.t to latent space (L) ∂dec(L ∂Lij i j) latent space through regions of interest, factoring in any addi- k 2. Obtain k heuristic distances Hij between the 2 points tional user-specified heuristics such as synthetic accessibility, 3. Determine path edge weight based on Jacobian and binding activity, or drug-likeness that will augment generated PN P k k heuristic: Pij ← ij Jij + k Hij · W and store in a molecules. k-d tree. Interpolation is performed by first calculating the Jacobian end for distances for all points of interest. This helps us to understand Apply a path finding algorithm such as Yen’s algorithm (with A*) how much each latent space point differs from another based to determine a path from s to d on the k-d tree. on the representation learnt by the model, and understand the Segment the path using m equidistant points. stretching and rotational transformations of the local neigh- return P ← (s, p1 , .., pm−2 , d) borhood of each point with respect to other points. A k-dimensional tree is built using the resulting Jacobian as OpenSMILES is a context-free instead of regular gram- distances as edge weights between compounds in the points mar it is still unable to model certain subtle characteristics of of interest. Therefore, placing compounds with greater struc- the SMILES grammar such as paired ring bonds; for example tural similarity in closer proximity on the tree. A k-d tree is the SMILES string ’c1ccccc1C2CCCC2’ would be decoded chosen since it divides the domain of search into half at each as ’c1ccccc1C2CCCC’, incorrectly dropping the final paired level. Hence search for a node in the tree can be done in log- digit. By incorporating a self-attention layer in the Grammar arithmic time and makes the data structure run time efficient. VAE architecture this effect was mitigated, and increased the Edges can also be weighted by user-specified domain heuris- validity of decoded test set molecules from 61% to 70%. tics, adding a weighted cost to augment the paths produced to generate molecules more relevant for a specific target. These 2.3 Manifold Traversal of Latent Space heuristics are listed below: Once a latent space has been generated we can apply our man- • Fingerprint Similarity: A fingerprint is a series of bi- ifold traversal algorithm to generate interpolative paths in the nary digits (bits) that represent the presence or absence latent space in order to decode molecules with desired prop- of particular substructures in the molecule. The simi- erties. A common approach here is to use linear or spheri- larity can be tested using cosine or Tanimoto distance cal interpolation [White, 2016], however both approaches as- metrics. The Tanimoto similarity takes into account the Table 1: Input and corresponding decoded SMILES strings of test set compounds generated by the model. Tanimoto similarity is used to measure the the structural similarity. An observation that can be made here is that small changes in the generated structure w.r.t to actual structure might lead to disproportionate changes in the similarity. Actual Compound Generated compound Similarity OC(=O)CSc1oc2ccccc2n1 OC(=O)CCS1cc2ccccs2n1 0.141 CC(=O)C(=Cc1ccccc1)c2ccccc2 O.CCCN(Cc1ccccc1c2)cccnn2 0.171 Cc1occc1C(=O)NCc2onc(n2)c3ccccc3 Nc1nnc1NC(=O)NCSc2nn(s2)c3ccccc3 0.221 Cc1ccc(SCC(=O)NNC(=O)c2cncc(Br)c2)cc1 Clc1ccc(SCC(=O)NC(=O)c2cncc(Br)c2cc1)CF 0.373 CCc1ccc(cc1)N(CC(=O)N2CCOCC2)S(=O)(=O)C CSc1ccc(cc1N(C(=O)Oc2CCCCC2S(=O)(=O)))ON 0.235 Table 2: Samples of the compounds generated while interpolating on the latent space path as shown in Figure 2. These compounds are not present in the train/test datasets, and we assign a potential target label based on labels of their nearest neighbours in the test set. SAS score indicates if a molecule is difficult to synthesize. According to the Lipinski’s rule of five [Benet et al., 2016], the molecular mass for an active drug should be less than 500 daltons. Activity range demonstrates the binding activity of a compound: the greater the range more active it is. Generated compound Activity range SAS Molecular Potential Label Score weight CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4C Less than 5 6.186 299.89 DIABETES CCC1(C)C(C)CCCCS1C(C1(O([C@+2](C1))))S Between 5 and 7 6.251 274.49 DIABETES CSCCCC2(C=C(O))S(C=O)C2C1[C@@](C1(C))CC Between 5 and 7 5.990 301.49 DIABETES CC1(N(S))NN1CCCNc1ncnc2CCn2c1CC Greater than 7 5.233 299.89 LUNG CANCER CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4Cl Less than 5 6.186 309.44 LUNG CANCER CCCC1CCC(S)N(C1)C(O)NCC Less than 5 4.355 232.93 DIABETES structural properties whereas cosine does not. decoded along the path of centroids generated just 3 com- pounds with valid structures. On the contrary, applying man- • Synthetic Accessibility (SA Score): A molecule syn- ifold traversal with fingerprint similarity and Yen’s algorithm thetic accessibility is a score which is between 1 (easy to as heuristic and perturbing the source and destination points produce) and 10 (very difficult to produce). This is cal- produced 4 different paths. These four paths generated a total culated based on fragment contributions and molecule of 156 valid, novel compounds along the interpolated mani- complexity. The absolute difference between two fold between the latent regions of diabetes and lung cancer molecules is taken into account. labeled molecules. Samples of these generated compounds • Drug-likeliness: This is a score which takes into account can be seen in Table 2. Specific regions to mine within the if the molecule is ’drug-like’. This is evaluated using latent space can be found by plotting different paths, bound several parameters such as molecular weight, solubility them and finding the overlap region where compounds with in water or lipophilic efficiency. The absolute difference the right structure and specific characteristics can be found. between two molecules is taken into account. • Binding activity: This demonstrates the potency to a tar- 3 Conclusions and Future Work get for a potential drug compound; less than 5 is consid- In this work we presented a model-agnostic platform for per- ered inactive, 5-7 of intermediate activity, greater than 7 forming manifold traversal on latent spaces with user speci- active. fied domain heuristics. This interpolation method allows us Yen’s algorithm [Yen, 1971] in combination with the A* to add more context and direction to search and discovery of algorithm is then applied on the k-d tree to find the short- molecules in the latent space. Methods for exploration of la- est path from source to destination given the user constraints. tent spaces generated from datasets of millions of molecules Once the shortest path is found, we interpolate along this path provide an extremely valuable tool for virtual drug screening, equidistantly and decode the points on the latent space using and an ability to facilitate rapid drug discovery. the generator to generate compounds. Multiple paths can be Some avenues for future work in this domain include: im- found by taking into account the shortest path and either per- plementation of alternative models to produce latent-spaces turbing it or by changing the number of interpolation points of various characteristics; more sophisticated methods for between the source and the destination, and intuitively this curve fitting in high dimensional spaces such as Bézier curves increases the overall number of novel generated compounds. and Gaussian regression; alternative search methods such as using evolutionary and genetic algorithms on the latent space. Manifold traversal is inherently more useful than linear or Future work would also focus on implementing latent space spherical interpolation as it gives users greater flexibility in evaluation metrics using this interpolation method to under- path exploration under various conditions, and empirically stand the underlying aspects of these spaces. demonstrates a much higher rate of valid decoded molecules. For example, when considering the diabetes and lung can- cer centroids; linear interpolation with 100 equidistant points References [Lim et al., 2018] Jaechang Lim, Seongok Ryu, Jin Woo [Anderson et al., 1987] E. Kim, and Woo Youn Kim. Molecular generative model Anderson, G.D. Veith, based on conditional variational autoencoder for de novo D. Weininger, and Minn.) Environmental Research molecular design. Journal of cheminformatics, 10(1):1–9, Laboratory (Duluth. SMILES, a Line Notation and 2018. Computerized Interpreter for Chemical Structures. Envi- ronmental research brief. U.S. Environmental Protection [Maziarka et al., 2020] Łukasz Maziarka, Agnieszka Pocha, Agency, Environmental Research Laboratory, 1987. Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel, and Michał Warchoł. Mol-cyclegan: a generative model for [Arvanitidis et al., 2017] Georgios Arvanitidis, Lars Kai molecular optimization. Journal of Cheminformatics, Hansen, and Søren Hauberg. Latent space oddity: on 12(1):1–18, 2020. the curvature of deep generative models. arXiv preprint arXiv:1710.11379, 2017. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, [Benet et al., 2016] Leslie Z Benet, Chelsea M Hosey, Oleg Łukasz Kaiser, and Illia Polosukhin. Attention is all you Ursu, and Tudor I Oprea. Bddcs, the rule of 5 and druga- need. In Advances in neural information processing sys- bility. Advanced drug delivery reviews, 101:89–98, 2016. tems, pages 5998–6008, 2017. [De Cao and Kipf, 2018] Nicola De Cao and Thomas Kipf. [White, 2016] Tom White. Sampling generative networks. Molgan: An implicit generative model for small molecular arXiv preprint arXiv:1609.04468, 2016. graphs. arXiv preprint arXiv:1805.11973, 2018. [Yen, 1971] Jin Y Yen. Finding the k shortest loopless [Fleming, 2018] Nic Fleming. How artificial intelligence is paths in a network. management Science, 17(11):712–716, changing drug discovery. Nature, 557(1):S55–S57, 2018. 1971. [Gaulton et al., 2017] Anna Gaulton, Anne Hersey, Michał Nowotka, A Patrı́cia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J Bellis, Elena Cibrián-Uhalte, et al. The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2017. [Gómez-Bombarelli et al., 2018] Rafael Gómez- Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamı́n Sánchez- Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru- Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018. [Irwin et al., 2012] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012. [James and Dalke, 2016] Vander-meersch T James, Craig A and A. Dalke. Opensmiles specification, 2016. [Kang and Cho, 2018] Seokho Kang and Kyunghyun Cho. Conditional molecular design with deep generative mod- els. Journal of chemical information and modeling, 59(1):43–52, 2018. [Kingma and Welling, 2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [Kusner et al., 2017] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational au- toencoder. In Proceedings of the 34th International Con- ference on Machine Learning-Volume 70, pages 1945– 1954. JMLR. org, 2017. [Landrum, ] Greg Landrum. Rdkit: Open-source cheminfor- matics.