ChemoVerse: Manifold traversal of latent spaces for novel molecule discovery

             Harshdeep Singh1∗ , Nicholas McCarthy2 , Qurrat Ul Ain3 , Jeremiah Hayes2
   1
     Swiss Federal Institute of Technology, Lausanne, 2 Accenture Labs, Dublin 3 AI Innovation Lab, DSAI,
                                                   Novartis
               harshdeep.harshdeep@epfl.ch, {nicholas.mccarthy,jer.hayes}@accenture.com,
                                         qurrat.ul ain@novartis.com


                            Abstract                                  embed the chemical properties of such into a vector space. By
                                                                      decoding from this ’latent’ space of chemical structure we can
       In order to design a more potent and effective                 generate new, previously unidentified chemical compounds.
       chemical entity, it is essential to identify molecu-              In the domain of computational chemistry, Simplified
       lar structures with the desired chemical properties.           molecular-input line-entry system or SMILES are a com-
       Recent advances in generative models using neural              mon string textual method for encoding and representation of
       networks and machine learning are being widely                 molecular structures [Anderson et al., 1987]. This facilitates
       used by many emerging startups and researchers                 the use of models more commonly used in natural language
       in this domain to design virtual libraries of drug-            processing. Therefore, SMILES strings have been used as
       like compounds. Although these models can help                 raw input strings to generative models, which are given the
       a scientist to produce novel molecular structures              task of encoding and decoding the SMILES string directly
       rapidly, the challenge still exists in the intelligent         [Anderson et al., 1987]. Advances were made by employing
       exploration of the latent spaces of generative mod-            variational autoencoders (VAE), a neural network comprised
       els, thereby reducing the randomness in the gen-               of an encoder that transforms a compound’s representation
       erative procedure. In this work we present a mani-             into a compressed latent space, and a decoder that gener-
       fold traversal with heuristic search to explore the la-        ates compounds from the latent space [Kingma and Welling,
       tent chemical space. Different heuristics and scores           2013]. Although, directed search of the resulting latent space
       such as the Tanimoto coefficient, synthetic accessi-           is difficult. To counter this, conditional variational autoen-
       bility, binding activity, and QED drug-likeness can            coders (CVAE)[Kang and Cho, 2018] were used in order
       be incorporated to increase the validity and proxim-           to facilitate the generation of new molecules with specified
       ity for desired molecular properties of the generated          molecular properties. This is achieved by incorporating the
       molecules. For evaluating the manifold traversal               molecular properties of a compound into the encoder layer
       exploration, we produce the latent chemical space              and helping in the generation of more drug-like molecules
       using various generative models such as grammar                [Lim et al., 2018]. Generative adversarial networks (GANs)
       variational autoencoders (with and without atten-              have also been applied in the same manner [Maziarka et al.,
       tion) as they deal with the randomized generation              2020], and have been recently combined with reinforcement
       and validity of compounds. With this novel traver-             learning and graph representation of molecules to optimize
       sal method, we are able to find more unseen com-               the generation of molecules with specified molecular proper-
       pounds and more specific regions to mine in the la-            ties [De Cao and Kipf, 2018].
       tent space. Finally, these components are brought
       together in a simple platform allowing users to per-              Discovery in the latent space generated by these models is
       form search, visualization and selection of novel              often performed using random sampling and linear interpola-
       generated compounds.                                           tion, primarily due to the ease of the implementation of these
                                                                      methods. However, this is not suitable for most generative
                                                                      models as their latent spaces are generally high dimensional
 1    Introduction                                                    and sparse. While doing traversal, we will traverse regions
                                                                      where the data is not very well represented. In other words,
 Designing a new chemical entity is a time consuming, expen-          this could lead to a ’dead zone’ as the space of molecular
 sive, and error-prone task. Pharmaceutical companies invest          samples in the training dataset are present only on a subset of
 billions of dollars into screening vast libraries of chemical        the latent space [White, 2016]. Hence, decoding a point from
 compounds for hit and lead identification [Fleming, 2018].           the latent space will end up returning noisy or invalid results.
 The past few years has seen the rise of deep generative models       It can also be challenging to incorporate contextual domain
 that can operate over large spaces of molecular structures and       information during search, and as a result discovery of com-
   ∗
     This work was performed during an internship programme at        pounds with specific properties is often very inconsistent.
 Accenture Labs, Dublin, Ireland.                                        In this work we implemented various flavours of auto-

Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
                Figure 1: Visualization of the latent space as a K-dimensional tree, while traversing the latent manifold
                of compounds used in the treatment of diabetes and lung cancer.


encoders as the generative model for producing sets of latent          ther nine modified in order to represent the more complex
spaces, and in particular show that our novel implementation           ChEMBL dataset.
of Grammar VAE [Kusner et al., 2017] with an additional at-
tention mechanism [Vaswani et al., 2017] is highly performa-           2.2   Latent Space Generation
tive, with a low rate of invalid molecules generated. We also          Three models are implemented in our system: a VAE
introduce a novel manifold interpolation method employing              [Kingma and Welling, 2013], a Grammar VAE [Kusner et al.,
the Riemannian metric [Arvanitidis et al., 2017] in conjunc-           2017], and a Grammar VAE with self-attention [Vaswani et
tion with a set of molecular property heuristics to perform            al., 2017]. As our search algorithm is model agnostic, latent
directed search and interpolation of these latent spaces in or-        spaces can be substituted with minimal effort. However, re-
der to design novel molecules with desired properties. This            sults will vary depending on the underlying data, model archi-
combination of generation and exploration of latent space has          tecture, and training parameters used to generate each latent
enabled us to not only design molecules which have not been            space.
seen before but also to explore new regions of latent chemical            The ChEMBL dataset is less standardized and contains
space where more potent chemical compounds may exist.                  more complex molecules, therefore we perform transfer
                                                                       learning by initially training each model on the ZINC dataset
2     System Architecture                                              for 50 epochs, then switching to the ChEMBL dataset for 50
                                                                       epochs. Training, validation and test sets of 85%, 10%, and
In this section we describe the various components of our              5% respectively was used, with the test set comprised entirely
system architecture: generation of latent spaces and our al-           of ChEMBL molecules. We use the Adam optimizer with
gorithm for manifold traversal.                                        a learning rate scheduler that is instantiated after 15 epochs
                                                                       with a factor of 0.1 (initialized at 0.001).
2.1    Data                                                               The encoder is comprised of three 1D convolutional layers
We used a dataset of 250,000 molecules drawn from the ZINC             with filters of size 9, 10 and 11 respectively, while the de-
dataset [Irwin et al., 2012], and an additional 100,000 drawn          coder is comprised of 3 gated recurrent units (GRU) of 501
from the ChEMBL dataset [Gaulton et al., 2017]. These                  units [Kusner et al., 2017]. The structural validity of the gen-
two datasets are comprised of commercially available drug              erated compounds are checked using the open-source RDKit
molecules and have been used in related work using models              library [Landrum, ]. Examples of test set compounds that
like variational autoencoders (VAEs) [Gómez-Bombarelli et             have been encoded and decoded are shown in table 1.
al., 2018]. Molecules are represented in canonical SMILES                 As noted in the original Grammar VAE work [Kusner et
string format, and are further processed into 1) a one-hot char-       al., 2017] the vanilla VAE architectures encoding SMILES
acter encoding and 2) a set of context-free grammar (CFG)              strings directly generally produce a very low valid decode
rules. Grammar rules are obtained from the OpenSMILES                  rate on larger molecular datasets - just 17% using a condi-
specification [James and Dalke, 2016], which denotes how               tional VAE under their bayesian optimization search method-
the SMILES representation was formed based on the rules.               ology. By instead using a Grammar VAE to generate produc-
This context free grammar (CFG) consists of 76 production              tion rules of a grammar instead of SMILES strings directly a
rules, to which an additional seven are added, and a fur-              much higher rate of valid compounds was attained. However,
                 Figure 2: Visualization of the path in the latent space along which our algorithm is interpolating. On the
                 left are generated compounds not present in the training data. On the right are histograms of different
                 molecular properties, including synthetic accessibility, activity scores.


Algorithm 1 Interpolation with manifold traversal                       sume that the latent space is Euclidean and flattened out, and
  Input: latent space L, source and destination points s and d,         generally produce noisier results [Arvanitidis et al., 2017].
  points of interest R, path length m                                   In our algorithm, source and destination points are selected
  Parameter: k heuristics and associated edge weights W k               in the latent space; these can either be latent space encod-
  Output: a path of m equidistant points from s to d                    ing of single molecules, or cluster centroids of molecules la-
  for each reference point i in R do                                    beled with a desired property. Points of interest are the set
     Find n nearest neighbors of i → Ni                                 of known molecules with desired properties, for example all
     for each (i, j), where j ∈ Ni do                                   molecules used in treatment of a specific condition. The goal
        1. Calculate Jacobian matrix Jij of decoder output (dec)
                                                                        is to define a path from source to destination points in the
        w.r.t to latent space (L) ∂dec(L
                                    ∂Lij
                                         i j)
                                                                        latent space through regions of interest, factoring in any addi-
                                         k
        2. Obtain k heuristic distances Hij between the 2 points        tional user-specified heuristics such as synthetic accessibility,
        3. Determine path edge weight based on Jacobian and             binding activity, or drug-likeness that will augment generated
                           PN          P k         k
        heuristic: Pij ←      ij Jij +   k Hij · W   and store in a     molecules.
        k-d tree.                                                          Interpolation is performed by first calculating the Jacobian
  end for                                                               distances for all points of interest. This helps us to understand
  Apply a path finding algorithm such as Yen’s algorithm (with A*)      how much each latent space point differs from another based
  to determine a path from s to d on the k-d tree.                      on the representation learnt by the model, and understand the
  Segment the path using m equidistant points.
                                                                        stretching and rotational transformations of the local neigh-
  return P ← (s, p1 , .., pm−2 , d)
                                                                        borhood of each point with respect to other points.
                                                                           A k-dimensional tree is built using the resulting Jacobian
as OpenSMILES is a context-free instead of regular gram-                distances as edge weights between compounds in the points
mar it is still unable to model certain subtle characteristics of       of interest. Therefore, placing compounds with greater struc-
the SMILES grammar such as paired ring bonds; for example               tural similarity in closer proximity on the tree. A k-d tree is
the SMILES string ’c1ccccc1C2CCCC2’ would be decoded                    chosen since it divides the domain of search into half at each
as ’c1ccccc1C2CCCC’, incorrectly dropping the final paired              level. Hence search for a node in the tree can be done in log-
digit. By incorporating a self-attention layer in the Grammar           arithmic time and makes the data structure run time efficient.
VAE architecture this effect was mitigated, and increased the           Edges can also be weighted by user-specified domain heuris-
validity of decoded test set molecules from 61% to 70%.                 tics, adding a weighted cost to augment the paths produced to
                                                                        generate molecules more relevant for a specific target. These
2.3   Manifold Traversal of Latent Space                                heuristics are listed below:
Once a latent space has been generated we can apply our man-               • Fingerprint Similarity: A fingerprint is a series of bi-
ifold traversal algorithm to generate interpolative paths in the              nary digits (bits) that represent the presence or absence
latent space in order to decode molecules with desired prop-                  of particular substructures in the molecule. The simi-
erties. A common approach here is to use linear or spheri-                    larity can be tested using cosine or Tanimoto distance
cal interpolation [White, 2016], however both approaches as-                  metrics. The Tanimoto similarity takes into account the
Table 1: Input and corresponding decoded SMILES strings of test set compounds generated by the model. Tanimoto similarity is used to
measure the the structural similarity. An observation that can be made here is that small changes in the generated structure w.r.t to actual
structure might lead to disproportionate changes in the similarity.

 Actual Compound                                                  Generated compound                                                Similarity
 OC(=O)CSc1oc2ccccc2n1                                            OC(=O)CCS1cc2ccccs2n1                                             0.141
 CC(=O)C(=Cc1ccccc1)c2ccccc2                                      O.CCCN(Cc1ccccc1c2)cccnn2                                         0.171
 Cc1occc1C(=O)NCc2onc(n2)c3ccccc3                                 Nc1nnc1NC(=O)NCSc2nn(s2)c3ccccc3                                  0.221
 Cc1ccc(SCC(=O)NNC(=O)c2cncc(Br)c2)cc1                            Clc1ccc(SCC(=O)NC(=O)c2cncc(Br)c2cc1)CF                           0.373
 CCc1ccc(cc1)N(CC(=O)N2CCOCC2)S(=O)(=O)C                          CSc1ccc(cc1N(C(=O)Oc2CCCCC2S(=O)(=O)))ON                          0.235

Table 2: Samples of the compounds generated while interpolating on the latent space path as shown in Figure 2. These compounds are not
present in the train/test datasets, and we assign a potential target label based on labels of their nearest neighbours in the test set. SAS score
indicates if a molecule is difficult to synthesize. According to the Lipinski’s rule of five [Benet et al., 2016], the molecular mass for an active
drug should be less than 500 daltons. Activity range demonstrates the binding activity of a compound: the greater the range more active it is.

  Generated compound                                                Activity range          SAS           Molecular       Potential Label
                                                                                            Score         weight
  CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4C Less than 5                                      6.186         299.89          DIABETES
  CCC1(C)C(C)CCCCS1C(C1(O([C@+2](C1))))S    Between 5 and 7                                 6.251         274.49          DIABETES
  CSCCCC2(C=C(O))S(C=O)C2C1[C@@](C1(C))CC Between 5 and 7                                   5.990         301.49          DIABETES
  CC1(N(S))NN1CCCNc1ncnc2CCn2c1CC           Greater than 7                                  5.233         299.89          LUNG CANCER
  CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4Cl Less than 5                                     6.186         309.44          LUNG CANCER
  CCCC1CCC(S)N(C1)C(O)NCC                   Less than 5                                     4.355         232.93          DIABETES


     structural properties whereas cosine does not.                         decoded along the path of centroids generated just 3 com-
                                                                            pounds with valid structures. On the contrary, applying man-
  • Synthetic Accessibility (SA Score): A molecule syn-
                                                                            ifold traversal with fingerprint similarity and Yen’s algorithm
    thetic accessibility is a score which is between 1 (easy to
                                                                            as heuristic and perturbing the source and destination points
    produce) and 10 (very difficult to produce). This is cal-
                                                                            produced 4 different paths. These four paths generated a total
    culated based on fragment contributions and molecule
                                                                            of 156 valid, novel compounds along the interpolated mani-
    complexity. The absolute difference between two
                                                                            fold between the latent regions of diabetes and lung cancer
    molecules is taken into account.
                                                                            labeled molecules. Samples of these generated compounds
  • Drug-likeliness: This is a score which takes into account               can be seen in Table 2. Specific regions to mine within the
    if the molecule is ’drug-like’. This is evaluated using                 latent space can be found by plotting different paths, bound
    several parameters such as molecular weight, solubility                 them and finding the overlap region where compounds with
    in water or lipophilic efficiency. The absolute difference              the right structure and specific characteristics can be found.
    between two molecules is taken into account.
  • Binding activity: This demonstrates the potency to a tar-               3    Conclusions and Future Work
    get for a potential drug compound; less than 5 is consid-               In this work we presented a model-agnostic platform for per-
    ered inactive, 5-7 of intermediate activity, greater than 7             forming manifold traversal on latent spaces with user speci-
    active.                                                                 fied domain heuristics. This interpolation method allows us
   Yen’s algorithm [Yen, 1971] in combination with the A*                   to add more context and direction to search and discovery of
algorithm is then applied on the k-d tree to find the short-                molecules in the latent space. Methods for exploration of la-
est path from source to destination given the user constraints.             tent spaces generated from datasets of millions of molecules
Once the shortest path is found, we interpolate along this path             provide an extremely valuable tool for virtual drug screening,
equidistantly and decode the points on the latent space using               and an ability to facilitate rapid drug discovery.
the generator to generate compounds. Multiple paths can be                     Some avenues for future work in this domain include: im-
found by taking into account the shortest path and either per-              plementation of alternative models to produce latent-spaces
turbing it or by changing the number of interpolation points                of various characteristics; more sophisticated methods for
between the source and the destination, and intuitively this                curve fitting in high dimensional spaces such as Bézier curves
increases the overall number of novel generated compounds.                  and Gaussian regression; alternative search methods such as
                                                                            using evolutionary and genetic algorithms on the latent space.
   Manifold traversal is inherently more useful than linear or
                                                                            Future work would also focus on implementing latent space
spherical interpolation as it gives users greater flexibility in
                                                                            evaluation metrics using this interpolation method to under-
path exploration under various conditions, and empirically
                                                                            stand the underlying aspects of these spaces.
demonstrates a much higher rate of valid decoded molecules.
For example, when considering the diabetes and lung can-
cer centroids; linear interpolation with 100 equidistant points
References                                                       [Lim et al., 2018] Jaechang Lim, Seongok Ryu, Jin Woo
[Anderson et al., 1987] E.                                          Kim, and Woo Youn Kim. Molecular generative model
                             Anderson,     G.D.     Veith,
                                                                    based on conditional variational autoencoder for de novo
  D. Weininger, and Minn.) Environmental Research
                                                                    molecular design. Journal of cheminformatics, 10(1):1–9,
  Laboratory (Duluth.      SMILES, a Line Notation and
                                                                    2018.
  Computerized Interpreter for Chemical Structures. Envi-
  ronmental research brief. U.S. Environmental Protection        [Maziarka et al., 2020] Łukasz Maziarka, Agnieszka Pocha,
  Agency, Environmental Research Laboratory, 1987.                  Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel, and
                                                                    Michał Warchoł. Mol-cyclegan: a generative model for
[Arvanitidis et al., 2017] Georgios Arvanitidis, Lars Kai           molecular optimization. Journal of Cheminformatics,
  Hansen, and Søren Hauberg. Latent space oddity: on                12(1):1–18, 2020.
  the curvature of deep generative models. arXiv preprint
  arXiv:1710.11379, 2017.                                        [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
                                                                    Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
[Benet et al., 2016] Leslie Z Benet, Chelsea M Hosey, Oleg          Łukasz Kaiser, and Illia Polosukhin. Attention is all you
  Ursu, and Tudor I Oprea. Bddcs, the rule of 5 and druga-          need. In Advances in neural information processing sys-
  bility. Advanced drug delivery reviews, 101:89–98, 2016.          tems, pages 5998–6008, 2017.
[De Cao and Kipf, 2018] Nicola De Cao and Thomas Kipf.           [White, 2016] Tom White. Sampling generative networks.
  Molgan: An implicit generative model for small molecular          arXiv preprint arXiv:1609.04468, 2016.
  graphs. arXiv preprint arXiv:1805.11973, 2018.                 [Yen, 1971] Jin Y Yen. Finding the k shortest loopless
[Fleming, 2018] Nic Fleming. How artificial intelligence is         paths in a network. management Science, 17(11):712–716,
   changing drug discovery. Nature, 557(1):S55–S57, 2018.           1971.
[Gaulton et al., 2017] Anna Gaulton, Anne Hersey, Michał
  Nowotka, A Patrı́cia Bento, Jon Chambers, David
  Mendez, Prudence Mutowo, Francis Atkinson, Louisa J
  Bellis, Elena Cibrián-Uhalte, et al. The chembl database in
  2017. Nucleic acids research, 45(D1):D945–D954, 2017.
[Gómez-Bombarelli et al., 2018] Rafael           Gómez-
  Bombarelli, Jennifer N Wei, David Duvenaud,
  José Miguel Hernández-Lobato, Benjamı́n Sánchez-
  Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre,
  Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-
  Guzik. Automatic chemical design using a data-driven
  continuous representation of molecules. ACS central
  science, 4(2):268–276, 2018.
[Irwin et al., 2012] John J Irwin,      Teague Sterling,
   Michael M Mysinger, Erin S Bolstad, and Ryan G
   Coleman. Zinc: a free tool to discover chemistry for
   biology. Journal of chemical information and modeling,
   52(7):1757–1768, 2012.
[James and Dalke, 2016] Vander-meersch T James, Craig A
   and A. Dalke. Opensmiles specification, 2016.
[Kang and Cho, 2018] Seokho Kang and Kyunghyun Cho.
  Conditional molecular design with deep generative mod-
  els. Journal of chemical information and modeling,
  59(1):43–52, 2018.
[Kingma and Welling, 2013] Diederik P Kingma and Max
  Welling. Auto-encoding variational bayes. arXiv preprint
  arXiv:1312.6114, 2013.
[Kusner et al., 2017] Matt J Kusner, Brooks Paige, and
  José Miguel Hernández-Lobato. Grammar variational au-
  toencoder. In Proceedings of the 34th International Con-
  ference on Machine Learning-Volume 70, pages 1945–
  1954. JMLR. org, 2017.
[Landrum, ] Greg Landrum. Rdkit: Open-source cheminfor-
  matics.