Continuous Representation of Molecules Using Graph Variational Autoencoder


                                          Mohammadamin Tavakoli, Pierre Baldi
                                                    Department of Computer Science,
                                                     University of California, Irvine
                                                    {mohamadt, pfbaldi}@ics.uci.edu


                                                                                               Side target
                            Abstract
                                                                                                             Side predictor
  In order to continuously represent molecules, we propose a
  generative model in the form of a VAE which is operating on                                                                     Pooling
  the 2D-graph structure of molecules. A side predictor is em-
                                                                         Node-feature matrix
  ployed to prune the latent space and help the decoder in gen-                                𝜇
  erating meaningful adjacency tensor of molecules. Other than                                  GCN
                                                                                                Encoder                                     Adjacency tensor
  the potential applicability in drug design and property predic-
  tion, we show the superior performance of this technique in             Adjacency tensor
                                                                                                                              GDCN
                                                                                                                              Decoder
  comparison to other similar methods based on the SMILES                                      𝜎
                                                                                                GCN
  representation of the molecules with RNN based encoder and                                    Encoder
  decoder.


                        Introduction
                                                                       Figure 1: Outline of the model. As depicted above, the model
Using machine learning to predict molecular structure prop-            inputs are both node-feature and adjacency tensor, while the
erties is a challenging problem [7, 3]. While the governing            model is only outputting the adjacency tensor. The side pre-
equations (e.g. Schrodinger equation) are difficult and com-           diction network is simply using the data points in the latent
putationally expensive to solve, the fact that an underlying           space as its input.
model exists is appealing for machine learning techniques.
However, this problem is difficult from a technical point of
view. The space of molecules is discrete and non-numerical.            work is presenting preliminary results of Graph VAE, further
Thus, “how to best represent molecules and atoms for ma-               experiments and comparisons are left to future work.
chine learning problems?” is still a question.
   Despite having numerous ways to represent molecules                                                       Method
such as methods introduced in [18, 1], all the representations         Molecules and Graphs A molecule can be represented by
are suffering from a few shortcomings, such as 1) discrete             an undirected graph G = (V, E, R), with nodes (atoms) vi
representation, 2) lengthy representation, 3) non-injective            ∈ V and labeled edges (bonds) (vi , e, vj ) ∈ E where r ∈ R
mapping, and 4) non-machine readable representation.                   is an edge type. Since we focus on small molecules with four
   Here, we proposed a new method that borrows the main                bond types, R is equal to 4. An n by d node-feature matrix
idea from [5] and [12] and overcomes all the aforementioned            H is also carrying more information about each node. These
shortcomings. Our method which takes the graphical struc-              two tensors, together, represent a molecular structure.
ture of the molecule as the inputs consists of a variational
framework with a side predictor to better prune the structure
of the latent space. Then an inner product decoder transfers           Variational Autoencodes To help ensure that points in the
the samples of latent space into meaningful adjacency ten-             latent space correspond to valid realistic molecules, and to
sors. To compare with the main benchmark which is a text-              minimize the dead areas of the latent space, we chose to use
based encoding of molecules [9] we performed two exper-                a variational autoencoder (VAE). To further ensure that the
iments on the QM9 dataset [16, 15] and ZINC [11]. Both                 outputs of the decoder are corresponding valid molecules we
experiments show the success of this method. Although this             employed the open-source cheminformatics suite RDKit30
                                                                       to validate the chemical structures of output molecules in
Copyright c 2020, for this paper by its authors. Use permit-           terms of atomic valence. All invalid outputs are discarded.
ted under Creative Commons License Attribution 4.0 International       It is necessary to mention that the ordering of the nodes as-
(CCBY 4.0).                                                            sumed to be unchanged.
VAE and Side Prediction To better learn the graph struc-
ture of the molecules, the encoder part of the VAE consists
of GCN layers. The same method as [17] has been employed
to perform relational update which can be formulated as:
                             (l)  (l) (l)
                  X X
         hl+1
          i   = σ(    Wr(l) hj + W0 hi )                                     Nicotine
                                                                                                    Ecstasy


                    i∈R j∈Nri

  where Nri denotes the set of nodes connected to node i
                                                                                                                                    Aspirin
through the edge type r ∈ R. Since we are focusing on small
molecules, we applied three layers of GCN in our encoder
model to gather information from 3-hop neighbors of each                   Amphetamine                Caffeine

atom. The structure of encoder consists of two, three-layer
GCNs for both mean and the covariance. GCNs of the en-
                                                                                  Figure 2: Drugs compare to Aspirin
coder share the filters of the first two layers. Here we can
formulate the encoding and sampling scheme as follows:
                           YN                                     Table 1: Three models trained with three different side prop-
            q(Z|H, A) =        qi (zi |H, A),                     erty. As shown below, using Druglikeliness better helps pre-
                               1                                  dicting Solubility and Synthesizability
            qi (zi |H, A) = N (zi |GCNµ , GCNσ )                       Side property     Valid outcome         Sol     Synt   Druglikeliness
   The GCNµ and similarly GCNσ are: GCN (H, A) =                         Solubility           75.3            97.03   88.7        84.2
Âσ(Âσ(ÂHW0 )W1 )W2 , where the Â is the normalized ad-            Synthesizability        73.0             89.8   98.21       86.3
                                                                       Druglikeliness         74.6             91.0    90.7      95.11
jacency tensor, Wi is the filter parameter of each layer, and
σ is the activation function [2]. Finally, as suggested in [12]
we use the simplest form of the decoder which can be seen
as graph deconvolution network. The output of the encoder         Property Prediction
is simply the inner product between latent variable:              Using a subset of QM9 dataset [15] as the training set,
                           N Y
                           Y   N                                  we extract 48,000 molecules covering a broad range of
                p(A|Z) =           p(Aij |zi , zj ),              molecules. Each molecule in the training set is chosen to
                           1       1                              have up to 20 atoms. The training objective on the side pre-
               p(Aij = 1|zi , zj ) = σ(zi T zj )                  dictor was set to be one of the Solubility, Druglikeliness,
                                                                  and Synthesizability. We employ the continuous representa-
   For the side prediction part, we employ a simple regres-       tion of molecules using each network to predict the other
sion model in the form of a multilayer perceptron (MLP) to        two unseen properties. The performance of each model plus
the network that predicts the properties from the latent space    the percentages of validly generated molecules are summa-
representation. The input of the side predictor is a vector ob-   rized in Table 1. In order to check the validity of the out-
tained through a pooling mechanism of the latent represen-        come, we only check for the validity of the atomic valence.
tation as follows:                                                As it is shown in Table 1 the accuracy of each property is
                           N
                          X                                       comparable to the state of the art property predictions men-
              G(H (L) ) =     sof tmax(hL   i .Wp )               tioned in [8]. Although Graph VAE is not outperforming the
                          i=1                                     predictions based on [8], it shows that using a property as a
Where WP is the pooling weight matrix and H (L) is the            heuristic to prune the latent space, can help with predicting
output of the GCNµ .                                              other molecule properties.
   Finally, the autoencoder is trained jointly on the recon-
struction task and a property prediction task; The joint loss     Molecular Similarity Measure
function is the summation of the two losses, as follows:          Numerous similarity or distance measures have been used
           L = ELBO + negative log likelihood                     widely to calculate the similarity or dissimilarity between
             = Eq(Z|H,A) −KL(q(Z|H, A)||p(Z))                     two samples. Since metrics are focusing more on 2-
                                                                  dimensional representation rather than 3-dimensional struc-
            + M SE(sidenetwork)
                                                                  ture, our model as a “2D structure-aware representation”
                                                                  is an accurate metric for the similarity measure. Normal-
                      Experiments                                 ized Euclidean distance between the latent representation of
We performed two experiments to show the usefulness of            two molecules after pooling operation is the metric we de-
continuous representation. In the first experiment, we focus      fine to capture the similarity. Here we compare three well-
on the prediction of property and the generation of the valid     known similarity measures with our technique and also to
molecules. In the second experiment, we use this continuous       the methods introduced in [9]. This method which is us-
representation to propose a new metric for measuring the          ing the SMILES representation of the molecules as the in-
molecular similarity.                                             put employs a VAE with a side predictor. Both encoder
                                                                                                Conclusion
Table 2: Similarity measures between Aspirin and four dif-
ferent drugs. Using Graph VAE as a new metrics, shows con-             We proposed a generative model through which we can find
sistency with other metrics. The GVAE is trained with the              continuous representation for molecules. As shown in the
solubility as the side property.                                       experiments section, this technique can be used in different
                                                                       chemoinformatics tasks such as drug design, drug discovery
     metric       Amphetamine   Ecstasy (MDMA)   Nicotine   Caffeine   and property prediction. As future work, one can think of at-
    Tanimoto        0.398             0.324       0.229      0.258     tention based graph convolutions and more complicated de-
      Dice          0.569             0.490       0.373      0.410     coders. These two extensions can be studied in future works.
     Cosine         0.607             0.490       0.374      0.434
   Graph VAE        0.363             0.199       0.147      0.176
 SMILES VAE [9]     0.724             0.489       0.340      0.321
                                                                                                 References
                                                                        [1] E. J. Bjerrum. Smiles enumeration as data augmentation
                                                                            for neural network modeling of molecules. arXiv preprint
                                                                            arXiv:1703.07076, 2017.
and decoder parts of the VAE are based on RRN and se-                   [2] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and
quence to sequence model. Although all the graphical in-                    accurate deep network learning by exponential linear units
formation of the molecule is encoded within the SMILES                      (elus), 2015.
representation, inferring the graphical structure (e.g., adja-          [3] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola,
cency tensor) from the SMILES string is an exhausting pro-                  W. H. Green, R. Barzilay, and K. F. Jensen. A graph-
cess that is based on several rules. Despite the numerous                   convolutional neural network model for the prediction of
techniques built upon using the SMILES representation of                    chemical reactivity. Chemical Science, 10(2):370–377, 2019.
the molecules [6, 10, 4, 14, 13], it has been shown that it             [4] A. Dalke. Deepsmiles: An adaptation of smiles for use in.
is more efficient to take advantage of the graph structures                 2018.
and employ GCNs to process molecular structures. Here,                  [5] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell,
we chose Aspirin as a sample drug and compare its simi-                     T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional
larity with four different drugs with four different similarity             networks on graphs for learning molecular fingerprints. In
measures. We compare the performances of our technique                      C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and
                                                                            R. Garnett, editors, Advances in Neural Information Process-
with [9], which is using a similar approach but operating on
                                                                            ing Systems 28, pages 2224–2232. Curran Associates, Inc.,
text representation of molecules. Our experiment shows that                 2015.
graph-based hidden representation is carrying more infor-
                                                                        [6] D. C. Elton, Z. Boukouvalas, M. D. Fuge, and P. W. Chung.
mation than only text. Table 2 is summarizing the result of
                                                                            Deep learning for molecular design-a review of the state of
the similarity measure experiment.                                          the art. Molecular Systems Design & Engineering, 2019.
   As it is shown in table 2, our metric is very well aligned           [7] D. Fooshee, A. Mood, E. Gutman, M. Tavakoli, G. Urban,
with all other well-known metrics which is another proof for                F. Liu, N. Huynh, D. Van Vranken, and P. Baldi. Deep learn-
the applicability of our model.                                             ing for chemical reaction prediction. Molecular Systems De-
                                                                            sign & Engineering, 2018.
                                                                        [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
                  Experiment Details                                        Dahl. Neural message passing for quantum chemistry. In Pro-
                                                                            ceedings of the 34th International Conference on Machine
                                                                            Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
GVAE consists of two GCNs for the encoder, a pooling
                                                                        [9] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M.
mechanism, and a multi-layer perceptron for the side pre-
                                                                            Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla,
diction. Both GCNs are three-layer networks with filter ma-                 J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and
trices W0 , W1 , and W2 of 32*32, 32*32m and 32*16 re-                      A. Aspuru-Guzik. Automatic chemical design using a data-
spectively. The pooling weight matrix Wp is of size 1*64                    driven continuous representation of molecules. ACS central
which outputs a vector of length 64 to represent the whole                  science, 4(2):268–276, 2018.
molecule. A two-layer MLP with 32 and 1 hidden units is                [10] M. Hirohara, Y. Saito, Y. Koda, K. Sato, and Y. Sakakibara.
employed to perform the regression task.                                    Convolutional neural network based on smiles representation
In Table 2, we use our own implementation of the SMILES                     of compounds for detecting chemical motif. BMC bioinfor-
VAE. Both GVA and SMILES VAE are trained using a                            matics, 19(19):526, 2018.
dataset of 70,000 molecules which are randomly selected                [11] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and
from ZINC.                                                                  R. G. Coleman. Zinc: a free tool to discover chemistry
In Table 2, all measures except the continuous representa-                  for biology. Journal of chemical information and modeling,
tions are calculated with the same fingerprinting algorithm.                52(7):1757–1768, 2012.
It identifies and hashes topological paths (e.g. along with            [12] T. N. Kipf and M. Welling. Semi-supervised classification
bonds) in the molecule and then uses them to set bits in a                  with graph convolutional networks, 2016.
fingerprint of length 2048. The set of parameters used by              [13] M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-
the algorithm is - minimum path size: 1 bond - maximum                      Guzik. Selfies: a robust representation of semantically con-
path size: 7 bonds - number of bits set per hash: 2 - target                strained graphs with an example application in chemistry.
on-bit density 0.3.                                                         arXiv preprint arXiv:1905.13741, 2019.
[14] S. Kwon and S. Yoon. Deepcci: End-to-end deep learning
     for chemical-chemical interaction prediction. In Proceedings
     of the 8th ACM International Conference on Bioinformatics,
     Computational Biology, and Health Informatics, pages 203–
     212. ACM, 2017.
[15] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilien-
     feld. Quantum chemistry structures and properties of 134 kilo
     molecules.
[16] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.-L. Rey-
     mond. Enumeration of 166 billion organic small molecules
     in the chemical universe database gdb-17. Journal of chemi-
     cal information and modeling, 52(11):2864–2875, 2012.
[17] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg,
     I. Titov, and M. Welling. Modeling relational data with graph
     convolutional networks. In European Semantic Web Confer-
     ence, pages 593–607. Springer, 2018.
[18] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse,
     A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a
     benchmark for molecular machine learning. Chemical sci-
     ence, 9(2):513–530, 2018.