Continuous Representation of Molecules Using Graph Variational Autoencoder Mohammadamin Tavakoli, Pierre Baldi Department of Computer Science, University of California, Irvine {mohamadt, pfbaldi}@ics.uci.edu Side target Abstract Side predictor In order to continuously represent molecules, we propose a generative model in the form of a VAE which is operating on Pooling the 2D-graph structure of molecules. A side predictor is em- Node-feature matrix ployed to prune the latent space and help the decoder in gen- 𝜇 erating meaningful adjacency tensor of molecules. Other than GCN Encoder Adjacency tensor the potential applicability in drug design and property predic- tion, we show the superior performance of this technique in Adjacency tensor GDCN Decoder comparison to other similar methods based on the SMILES 𝜎 GCN representation of the molecules with RNN based encoder and Encoder decoder. Introduction Figure 1: Outline of the model. As depicted above, the model Using machine learning to predict molecular structure prop- inputs are both node-feature and adjacency tensor, while the erties is a challenging problem [7, 3]. While the governing model is only outputting the adjacency tensor. The side pre- equations (e.g. Schrodinger equation) are difficult and com- diction network is simply using the data points in the latent putationally expensive to solve, the fact that an underlying space as its input. model exists is appealing for machine learning techniques. However, this problem is difficult from a technical point of view. The space of molecules is discrete and non-numerical. work is presenting preliminary results of Graph VAE, further Thus, “how to best represent molecules and atoms for ma- experiments and comparisons are left to future work. chine learning problems?” is still a question. Despite having numerous ways to represent molecules Method such as methods introduced in [18, 1], all the representations Molecules and Graphs A molecule can be represented by are suffering from a few shortcomings, such as 1) discrete an undirected graph G = (V, E, R), with nodes (atoms) vi representation, 2) lengthy representation, 3) non-injective ∈ V and labeled edges (bonds) (vi , e, vj ) ∈ E where r ∈ R mapping, and 4) non-machine readable representation. is an edge type. Since we focus on small molecules with four Here, we proposed a new method that borrows the main bond types, R is equal to 4. An n by d node-feature matrix idea from [5] and [12] and overcomes all the aforementioned H is also carrying more information about each node. These shortcomings. Our method which takes the graphical struc- two tensors, together, represent a molecular structure. ture of the molecule as the inputs consists of a variational framework with a side predictor to better prune the structure of the latent space. Then an inner product decoder transfers Variational Autoencodes To help ensure that points in the the samples of latent space into meaningful adjacency ten- latent space correspond to valid realistic molecules, and to sors. To compare with the main benchmark which is a text- minimize the dead areas of the latent space, we chose to use based encoding of molecules [9] we performed two exper- a variational autoencoder (VAE). To further ensure that the iments on the QM9 dataset [16, 15] and ZINC [11]. Both outputs of the decoder are corresponding valid molecules we experiments show the success of this method. Although this employed the open-source cheminformatics suite RDKit30 to validate the chemical structures of output molecules in Copyright c 2020, for this paper by its authors. Use permit- terms of atomic valence. All invalid outputs are discarded. ted under Creative Commons License Attribution 4.0 International It is necessary to mention that the ordering of the nodes as- (CCBY 4.0). sumed to be unchanged. VAE and Side Prediction To better learn the graph struc- ture of the molecules, the encoder part of the VAE consists of GCN layers. The same method as [17] has been employed to perform relational update which can be formulated as: (l) (l) (l) X X hl+1 i = σ( Wr(l) hj + W0 hi ) Nicotine Ecstasy i∈R j∈Nri where Nri denotes the set of nodes connected to node i Aspirin through the edge type r ∈ R. Since we are focusing on small molecules, we applied three layers of GCN in our encoder model to gather information from 3-hop neighbors of each Amphetamine Caffeine atom. The structure of encoder consists of two, three-layer GCNs for both mean and the covariance. GCNs of the en- Figure 2: Drugs compare to Aspirin coder share the filters of the first two layers. Here we can formulate the encoding and sampling scheme as follows: YN Table 1: Three models trained with three different side prop- q(Z|H, A) = qi (zi |H, A), erty. As shown below, using Druglikeliness better helps pre- 1 dicting Solubility and Synthesizability qi (zi |H, A) = N (zi |GCNµ , GCNσ ) Side property Valid outcome Sol Synt Druglikeliness The GCNµ and similarly GCNσ are: GCN (H, A) = Solubility 75.3 97.03 88.7 84.2 Âσ(Âσ(ÂHW0 )W1 )W2 , where the  is the normalized ad- Synthesizability 73.0 89.8 98.21 86.3 Druglikeliness 74.6 91.0 90.7 95.11 jacency tensor, Wi is the filter parameter of each layer, and σ is the activation function [2]. Finally, as suggested in [12] we use the simplest form of the decoder which can be seen as graph deconvolution network. The output of the encoder Property Prediction is simply the inner product between latent variable: Using a subset of QM9 dataset [15] as the training set, N Y Y N we extract 48,000 molecules covering a broad range of p(A|Z) = p(Aij |zi , zj ), molecules. Each molecule in the training set is chosen to 1 1 have up to 20 atoms. The training objective on the side pre- p(Aij = 1|zi , zj ) = σ(zi T zj ) dictor was set to be one of the Solubility, Druglikeliness, and Synthesizability. We employ the continuous representa- For the side prediction part, we employ a simple regres- tion of molecules using each network to predict the other sion model in the form of a multilayer perceptron (MLP) to two unseen properties. The performance of each model plus the network that predicts the properties from the latent space the percentages of validly generated molecules are summa- representation. The input of the side predictor is a vector ob- rized in Table 1. In order to check the validity of the out- tained through a pooling mechanism of the latent represen- come, we only check for the validity of the atomic valence. tation as follows: As it is shown in Table 1 the accuracy of each property is N X comparable to the state of the art property predictions men- G(H (L) ) = sof tmax(hL i .Wp ) tioned in [8]. Although Graph VAE is not outperforming the i=1 predictions based on [8], it shows that using a property as a Where WP is the pooling weight matrix and H (L) is the heuristic to prune the latent space, can help with predicting output of the GCNµ . other molecule properties. Finally, the autoencoder is trained jointly on the recon- struction task and a property prediction task; The joint loss Molecular Similarity Measure function is the summation of the two losses, as follows: Numerous similarity or distance measures have been used L = ELBO + negative log likelihood widely to calculate the similarity or dissimilarity between = Eq(Z|H,A) −KL(q(Z|H, A)||p(Z)) two samples. Since metrics are focusing more on 2- dimensional representation rather than 3-dimensional struc- + M SE(sidenetwork) ture, our model as a “2D structure-aware representation” is an accurate metric for the similarity measure. Normal- Experiments ized Euclidean distance between the latent representation of We performed two experiments to show the usefulness of two molecules after pooling operation is the metric we de- continuous representation. In the first experiment, we focus fine to capture the similarity. Here we compare three well- on the prediction of property and the generation of the valid known similarity measures with our technique and also to molecules. In the second experiment, we use this continuous the methods introduced in [9]. This method which is us- representation to propose a new metric for measuring the ing the SMILES representation of the molecules as the in- molecular similarity. put employs a VAE with a side predictor. Both encoder Conclusion Table 2: Similarity measures between Aspirin and four dif- ferent drugs. Using Graph VAE as a new metrics, shows con- We proposed a generative model through which we can find sistency with other metrics. The GVAE is trained with the continuous representation for molecules. As shown in the solubility as the side property. experiments section, this technique can be used in different chemoinformatics tasks such as drug design, drug discovery metric Amphetamine Ecstasy (MDMA) Nicotine Caffeine and property prediction. As future work, one can think of at- Tanimoto 0.398 0.324 0.229 0.258 tention based graph convolutions and more complicated de- Dice 0.569 0.490 0.373 0.410 coders. These two extensions can be studied in future works. Cosine 0.607 0.490 0.374 0.434 Graph VAE 0.363 0.199 0.147 0.176 SMILES VAE [9] 0.724 0.489 0.340 0.321 References [1] E. J. Bjerrum. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076, 2017. and decoder parts of the VAE are based on RRN and se- [2] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and quence to sequence model. Although all the graphical in- accurate deep network learning by exponential linear units formation of the molecule is encoded within the SMILES (elus), 2015. representation, inferring the graphical structure (e.g., adja- [3] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, cency tensor) from the SMILES string is an exhausting pro- W. H. Green, R. Barzilay, and K. F. Jensen. A graph- cess that is based on several rules. Despite the numerous convolutional neural network model for the prediction of techniques built upon using the SMILES representation of chemical reactivity. Chemical Science, 10(2):370–377, 2019. the molecules [6, 10, 4, 14, 13], it has been shown that it [4] A. Dalke. Deepsmiles: An adaptation of smiles for use in. is more efficient to take advantage of the graph structures 2018. and employ GCNs to process molecular structures. Here, [5] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, we chose Aspirin as a sample drug and compare its simi- T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional larity with four different drugs with four different similarity networks on graphs for learning molecular fingerprints. In measures. We compare the performances of our technique C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Process- with [9], which is using a similar approach but operating on ing Systems 28, pages 2224–2232. Curran Associates, Inc., text representation of molecules. Our experiment shows that 2015. graph-based hidden representation is carrying more infor- [6] D. C. Elton, Z. Boukouvalas, M. D. Fuge, and P. W. Chung. mation than only text. Table 2 is summarizing the result of Deep learning for molecular design-a review of the state of the similarity measure experiment. the art. Molecular Systems Design & Engineering, 2019. As it is shown in table 2, our metric is very well aligned [7] D. Fooshee, A. Mood, E. Gutman, M. Tavakoli, G. Urban, with all other well-known metrics which is another proof for F. Liu, N. Huynh, D. Van Vranken, and P. Baldi. Deep learn- the applicability of our model. ing for chemical reaction prediction. Molecular Systems De- sign & Engineering, 2018. [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Experiment Details Dahl. Neural message passing for quantum chemistry. In Pro- ceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017. GVAE consists of two GCNs for the encoder, a pooling [9] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. mechanism, and a multi-layer perceptron for the side pre- Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, diction. Both GCNs are three-layer networks with filter ma- J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and trices W0 , W1 , and W2 of 32*32, 32*32m and 32*16 re- A. Aspuru-Guzik. Automatic chemical design using a data- spectively. The pooling weight matrix Wp is of size 1*64 driven continuous representation of molecules. ACS central which outputs a vector of length 64 to represent the whole science, 4(2):268–276, 2018. molecule. A two-layer MLP with 32 and 1 hidden units is [10] M. Hirohara, Y. Saito, Y. Koda, K. Sato, and Y. Sakakibara. employed to perform the regression task. Convolutional neural network based on smiles representation In Table 2, we use our own implementation of the SMILES of compounds for detecting chemical motif. BMC bioinfor- VAE. Both GVA and SMILES VAE are trained using a matics, 19(19):526, 2018. dataset of 70,000 molecules which are randomly selected [11] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and from ZINC. R. G. Coleman. Zinc: a free tool to discover chemistry In Table 2, all measures except the continuous representa- for biology. Journal of chemical information and modeling, tions are calculated with the same fingerprinting algorithm. 52(7):1757–1768, 2012. It identifies and hashes topological paths (e.g. along with [12] T. N. Kipf and M. Welling. Semi-supervised classification bonds) in the molecule and then uses them to set bits in a with graph convolutional networks, 2016. fingerprint of length 2048. The set of parameters used by [13] M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru- the algorithm is - minimum path size: 1 bond - maximum Guzik. Selfies: a robust representation of semantically con- path size: 7 bonds - number of bits set per hash: 2 - target strained graphs with an example application in chemistry. on-bit density 0.3. arXiv preprint arXiv:1905.13741, 2019. [14] S. Kwon and S. Yoon. Deepcci: End-to-end deep learning for chemical-chemical interaction prediction. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 203– 212. ACM, 2017. [15] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilien- feld. Quantum chemistry structures and properties of 134 kilo molecules. [16] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.-L. Rey- mond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemi- cal information and modeling, 52(11):2864–2875, 2012. [17] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Confer- ence, pages 593–607. Springer, 2018. [18] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical sci- ence, 9(2):513–530, 2018.