MatVAE: Independently Trained Nested Variational Autoencoder for Generating Chemical Structural Formula Yoshihiro Osakabe,1 Akinori Asahara1 1 Hitachi, Ltd. Research and Development Group yoshihiro.osakabe.fj@hitachi.com, akinori.asahara.bq@hitachi.com Abstract statistical models. However, there are two problems when solving the inverse problem to obtain the chemical structure Rapid materials development utilizes deep generative models with desired properties. The first problem is that a statistical to suggest candidate compounds with desirable properties be- fore actual experiments. Such models successfully generate model is generally a complex nonlinear function, so an in- novel candidates with improved properties in some cases, but verse function that yields a descriptor from a property value they usually require a large experimental dataset which is dif- cannot be obtained explicitly. The second problem is that ficult to obtain. We propose MatVAE–two nested VAEs inde- even if a descriptor is obtained, it is difficult to construct a pendently trained on different datasets. The first VAE, which chemical structure containing that descriptor. is trained on a huge open dataset, is a universal generator of In recent years, an approach that has been extensively chemical structural formulae, and the second VAE, which is studied is to use deep generative models to directly obtain trained on a small experimental dataset, learns the structure– chemical structures that have desirable properties without property relation. This training framework can be understood as a semi-supervised learning, which is expected to enhance calculating the descriptors. Previously, methods for training model transferability. We verified that MatVAE generates five a generative adversarial network (GAN) with the reinforce- times more valid candidate compounds than the conventional ment learning (RL) framework have been reported (Olive- un-nested single VAE model. crona et al. 2017; De Cao and Kipf 2018). The major differ- ence between them is the representation of the compounds; REINVENT (Olivecrona et al. 2017) and MolGAN (De Cao Introduction and Kipf 2018) use text-based and graph-based representa- Determining the optimal combination of ingredients and tions, respectively. In RL, the atoms are attached to the main parameters can be a costly and time-consuming process chain in order to attain more optimal property values. How- in product development. Materials informatics (MI) is an ever, such a sequential method of generation is considered emerging field that integrates informatics and materials sci- to be challenging to apply because the property values of a ence with the goal of greatly reducing the resources and chemical substance can change significantly with or without risks involved in discovering, investing in, and deploying local substructures, such as functional groups. new materials (Curtarolo et al. 2013). Recently, artificial in- Another approach uses variational autoencoder (VAE) telligence (AI) has led to improvements in MI; experimental models; for example, JT-VAE using graph-based represen- candidates can be narrowed down without unnecessary trial tation (Jin, Barzilay, and Jaakkola 2018) and ChemicalVAE and error before actual experiments to discover or create new using text-based representation (Gómez-Bombarelli et al. materials with desirable property values. 2018). With VAEs, the chemical structure can be directly Statistically modeling the relationship between descrip- obtained by specifying one point in its latent space. Stud- tors and property values of a compound is a widely used ies have shown that it is possible to optimize the property method for predicting the property values of candidate com- values by searching the latent space because of the conti- pounds without conducting experiments. In drug discovery, nuity of the space. However, it is important to note that the this relationship is referred to as the quantitative structure– previous studies required huge training datasets. Chemical- activity relationship (QSAR). In general, such methodol- VAE and REINVENT were trained using supervised learn- ogy is based on two processes; calculating descriptors from ing with 250,000 and 350,000 compound data extracted the chemical structure by extracting structural features and from the ZINC database, respectively. In most cases, when building a model integrating the descriptors and the property developing industrial chemical products, only a few hundred values. The constructed statistical model can be used to pre- or thousand supervised training data, i.e., data with property dict property values from the chemical structures. Recently, values measured by experiments, are available for a single machine learning techniques have been widely used to build product family. Copyright © 2021for this paper by its authors. Use permitted under The purpose of this study is to propose a method for sug- Creative Commons License Attribution 4.0 International (CC BY gesting viable candidate compounds with improved proper- 4.0) ties even with a small amount of experimental data. In addi- Figure 1: Diagram of the proposed nested variational autoencoders used for molecular generation. tion, great attention should be paid to the transferability of to build a new regression model with the latent variable the trained model because training deep generative models is and the property values. ChemicalVAE follows the second costly in general. Hitachi, Ltd. provides such MI solutions, approach. To ensure the VAE generates effective candi- such as a cloud-based IT platform for non-experts (Osak- date molecules that have improved property values, Chem- abe, Asahara, and Morita 2020), to several material manu- icalVAE has an additional network, the predictor network, facturers. The proposed method is based on semi-supervised which estimates property values from the latent continu- learning so that the trained model can be reused in different ous vector z. Because of the continuity of the latent space, projects, which is advantageous to such business. ChemicalVAE can automatically generate novel chemical structures by simple operations in the latent space, such as Related Works decoding random vectors, sampling the neighbors of known chemical structures, or interpolating between molecules. Variational Autoencoder for Molecular Generation However, the continuity of the latent space is only guar- Deep neural networks (DNNs) can be used to represent anteed around the points of the known molecules, and the highly nonlinear relationships. They have outperformed effectiveness of searching for novel candidates expected to other methods in various fields on regression and classi- have more optimal property values depends on the perfor- fication tasks. Recently, DNNs have been utilized to gen- mance of the predictor network. In general, such a predic- erate a novel sample similar to the training data, i.e., a tor requires a large amount of training data including both deep generative model. The deep generative model assumes SMILES and corresponding property values. In most cases, that the observed data x is generated from an unobserved it is difficult to prepare a large experimental dataset. latent variable z and aims to learn a transformation rule p(x|z). One of the deep generative models is the varia- Semi-supervised Learning for VAE tional autoencoder (VAE), which contains two neural net- A conditional VAE (CVAE) has been developed as an exten- works, an encoder network, and a decoder network. The sion of VAE, which takes into account the objective variable VAE assumes a continuous latent variable z to be a mul- y in the formulation of the latent variable z. Kingma et al. tidimensional Gaussian distribution. The encoder is trained proposed a deep generative model called the M1+M2 model, to convert a data x to a latent variable z, and the decoder based on CVAE architectures. The M1 model is trained with is simultaneously trained to convert z back to x. Gómez- a large amount of unlabeled data which lacks information on Bombarelli et al. integrated this VAE framework into au- the objective variable y, and then the M2 model is trained tomatic molecular design, namely ChemicalVAE (Gómez- with the latent variable z by mixing a few labeled data and Bombarelli et al. 2018). Starting from a discrete molecular a large amount of unlabeled data. This model achieved 96% representation such as a SMILES string (Weininger 1988), correctness in a handwriting image recognition task using the ChemicalVAE encoder converts this discrete represen- MNIST where only 100 images out of 70,000 images were tation x of a molecule into a real-valued continuous vector given the labels (correct answers). This approach can be un- z, and its decoder converts z back to the discrete molecular derstood as semi-supervised learning and is expected to be representation x. With ChemicalVAE, the encoder utilizes effective in generating molecules when limited by a small a one-dimensional convolutional neural network (CNN) to amount of experimental data. However, both the M1 and M2 convolve strings, and the decoder utilizes a recurrent neural models require a large amount of unlabeled data for training. network (RNN) to generate SMILES strings. By selecting a When replacing the experimental data, the M2 model has to point in the latent space and passing it through the decoder, be trained again, which is not desirable in terms of compu- the corresponding SMILES can be obtained directly. tation time and diversion of the trained model. Two approaches can be used to generate a candidate molecule with the desired property values. One is to only use the trained VAE to generate the novel structure and pre- Proposed Method dict the property values by calculating the descriptors with In this paper, we propose a generative model consisting of the conventional QSAR framework. The other approach is two nested VAEs by introducing semi-supervised learning Figure 2: Diagram of learning steps; (a) steps 1 and 2 for training the outer VAE, and (b) step 3 for training the inner VAE. similar to the M1+M2 model. To account for the diversion of the trained model, the two VAEs are trained independently on different datasets. Figure 1 shows the diagram of the pro- posed model, MatVAE. The first VAE (outer VAE) is trained to learn structure characteristics using a huge compound structure dataset without property values, such as an open Figure 3: Example of SMILES representation and one-hot dataset. The second VAE (inner VAE) is trained to learn the vectors for benzene. For simplicity, only two characters are relation between the structure characteristics and the prop- shown in the one-hot encoding. In practice, one-hot vectors erty values to be improved using an experimental dataset forms M × N matrix, where M is the number of SMILES that includes both structure information and its property val- symbols and N is the maximum length of SMILES strings. ues. Unlike the M1+M2 model, the costly training with the huge dataset only needs to be done once for the outer VAE, and there is no need to replace the trained outer VAE if the experimental data changes. This model is expected to gener- up of M × N matrix, where M is the number of SMILES ate candidate compounds with improved property values by symbols and N is the maximum length of SMILES strings. giving a vector similar to the existing top-level compounds In our experiment, M is 101 and N is 90. The one-hot paired with the desired property values. vector indicates the presence and absence of each symbol within a sequence, as illustrated in Fig. 3. InChI (Heller Molecular Representations et al. 2013) is another common representation of molecules, but Gómez-Bombarelli et al. reported that it is less effective SMILES (Weininger 1988) is a common format for repre- than SMILES with VAE architectures (Gómez-Bombarelli senting molecules as a character sequence. With advances in et al. 2018). Note that SELFIES (Krenn et al. 2020) is an- natural language processing, this text-based format is widely other text-based representation based on the combination of used in MI applications such as predicting (Schwaller et al. strings and graph expressions of molecules. SELFIES was 2018, 2019a; Coley et al. 2019; Bradshaw et al. 2019) and found to be highly robust against mutations in sequences classifying (Schwaller et al. 2019b) chemical reactions. To and outperformed other representations (including SMILES make SMILES capable of inputting to VAE architectures, strings) in terms of diversity, validity, and reconstruction the SMILES strings are encoded into one-hot vectors made accuracy when applied to sequence-based VAEs. SELF- Figure 4: Diagram of the generation phase for MatVAE. IES is compatible with SMILES and can be handled with structure, the inner VAE should be able to extract such rela- one-hot encoding similarly to SMILES. We have confirmed tions more easily because the input dimension is reduced. that SELFIES improves the validity of generated molecu- The input of the inner VAE is a vector concatenated with lar strings with MatVAE, but in this paper SMILES is used a latent vector of the outer VAE and a vector consisting of as an input format to clarify whether the proposed nested property values. Therefore, the input size can be L1 + P , VAEs are more effective than the conventional single VAE where P is the number of the target properties to be im- model (Gómez-Bombarelli et al. 2018) for generating can- proved or optimized, as shown in Fig. 2 (b). didate compounds. The structure of the inner VAE network is as follows: both the inner encoder and decoder comprise multiple fully con- Outer VAE nected layers with the latent space of width L2 . Therefore, the number of nodes in each layer is automatically deter- The purpose of the outer VAE is to prepare a general gener- mined to be equal to the ratio between L1 + P and L2 . In ative model for creating a one-hot vector that can be trans- this report, the latent space dimension L2 is defined to be formed into corresponding valid SMILES strings. Since the 128, and the inner encoder and decoder are both defined to outer VAE does not need property value information, the have three layers. SMILES training dataset can be easily curated from any open dataset such as ZINC (Irwin et al. 2012). All of the Learning and Generating Procedures SMILES are converted into one-hot vectors before they are The inner and the outer VAEs are trained independently. input to the outer VAE. Again, the required datasets for training the two VAEs are The structure of the outer VAE network is based on Chem- (1) the structure dataset containing only SMILES and cu- icalVAE (Gómez-Bombarelli et al. 2018) as follows: the rated from open datasets, and (2) the experimental dataset outer encoder utilizes the three one-dimensional convolu- containing both SMILES and experimental property values. tional layers to convolve strings, where the filter sizes are Figure 2 shows the following three steps which make up 9, 9, 10 and the convolution kernels are 9, 9, 11, followed by the learning procedure: one fully connected layer of dimension L1 = 196, where L1 is the size of the latent space of outer VAE. The de- Step 1 Train outer VAE with structure dataset coder utilizes three layers of gated recurrent unit (GRU) net- Step 2 Train outer VAE with SMILES in experimen- works (Chung et al. 2014) with a hidden dimension of 488. tal dataset The final layer of the decoder outputs a probability distribu- Step 3 Train inner VAE with experimental dataset tion over all possible symbols at each position in a SMILES string. As a result, the same point in latent space can be At a glance, Step 2 appears to be redundant. However, it is decoded into a different SMILES string, and the generated necessary to ensure the continuity in the latent spaces even string may be invalid. Similarly to (Gómez-Bombarelli et al. around compounds in the experimental dataset. Unlike the 2018), the GRU layer is updated to improve its performance M1+M2 model, the inner VAE is only trained on the labeled by using additional input (Williams and Zipser 1989). data with the experimental dataset. Hence, if the outer VAE does not learn those compounds, the two VAEs cannot work Inner VAE together in the generation phase. In fact, the generation of candidate compounds with the nested VAEs without Step 2 The purpose of the inner VAE is to learn the structure– resulted in almost no valid SMILES generated. property relations. By using the latent vector of the trained Figure 4 illustrates the generation phase. To generate the outer VAE instead of one-hot vectors to express molecular candidate compounds expected to have improved proper- Role Size Components SA Score Range Structure 500k SMILES – Experimental 1k SMILES, SA Score si ≥ 2.05 Top-level 0.1k SMILES 2.05 ≥ si ≥ 2.00 Table 1: Overview of the training datasets in the experiment. ties, we use the vector with the width L1 + P consisting of the latent vector similar to the existing top-level com- pounds paired with the desired property values. Candidate compounds can then be generated more efficiently by com- prehensively combining multiple values for a single latent vector within a range close to the target property value. Experiment Figure 5: The number of optimal and valid candidates gen- Setup erated by the conventional ChemicalVAE and the proposed Candidate compounds generated by MatVAE are not guar- MatVAE. anteed to be synthesizable even if they are valid according to the SMILES grammar. In this experiment, the synthetic accessibility (SA) score (Ertl and Schuffenhauer 2009) is inal model. This is because ChemicalVAE is not intended used instead of the actual chemical properties so that the true to be trained on such a small amount of experimental data. value for the unknown SMILES can be calculated without In fact, when the number of experimental data is increased conducting actual experiments. The SA score is an index of to at least 5000, ChemicalVAE outperforms MatVAE. Thus, ease of fabrication and is uniquely determined for any valid for a small amount of experimental data, MatVAE can pro- SMILES. duce more than five times as many optimal compounds as One of the objectives of MatVAE is to generate can- ChemicalVAE. didate compounds that have improved property values. In other words, the model is expected to generate SMILES Conclusion with property values outside the range of the given experi- We have proposed a new deep generative model, MatVAE, mental dataset. To evaluate MatVAE, we deliberately limited for producing molecules with optimal property values. The the range of SA scores included in the experimental train- model consists of two variational autoencoders which are ing dataset, as shown in Table 1. The experimental dataset trained independently on different datasets. The first (outer) is limited to include compounds in which si is larger than VAE aims to capture the structure characteristics according 2.05, where si denotes the SA score of the i-th compounds to the grammar of a text-based representation of molecules in the dataset. This means that the inner VAE cannot learn such as SMILES. Even if the type of target properties the structure–property relation in the range below 2.05. One changes, this VAE does not need to be replaced, making it hundred compounds with 2.05 ≥ si ≥ 2.00 were curated transferable to other applications. The second (inner) VAE for a top-level compounds dataset used as the VAE input in can learn structure–property relations directly using the ex- the generation phase. Note that all of the compounds in this perimental dataset because the first VAE extracts structural experiment were curated from the ZINC database. features of molecules by reducing the dimensions of the For the evaluation, the proposed model generated candi- one-hot vectors converted from SMILES. Therefore, it can date compounds repeatedly until valid SMILES are gener- generate new molecule representations by inputting existing ated, and the number of the generated compounds with SA top-level compounds structure data paired with the desired scores below 2.0 was counted. If MatVAE is able to gener- property values. Compared with the common optimization ate such compounds, it would be confirmed that the model methodology for VAE, which utilizes gradient-based search successfully suggests unknown compounds that exceed the in its latent space, the proposed search method is more existing top-level compounds. straightforward and user-friendly. There are a number of possible improvements that can be Results made to MatVAE. In this study, we used text-based molec- Figure 5 shows the number of valid, optimal compounds ular encoding and a GRU decoder. Such architecture may from the 100 suggested valid candidates, and the red line make it unnecessarily difficult to produce valid SMILES indicates the performance of MatVAE. Even when the ex- strings. It is possible to use a graph-based representation in- perimental dataset is small, MatVAE successfully generates stead of text-based, or to change the sequence-based VAE valid SMILES with an improved SA score, though the ratio to graph-based. However, even without changing the net- is not high. The blue line indicates the performance of the work configuration, converting SMILES to one-hot vectors conventional method, ChemicalVAE, which is considerably via SELFIES representation could prevent the generation of lower than that of the original results, though all parame- invalid SMILES. We have already verified the effectiveness ters and hyperparameters are the same as those of the orig- of this approach in other experiments. Generated compounds are likely to have similar substruc- and Modeling 52(7): 1757–1768. ISSN 1549-9596. doi: tures, such as carbon chains, because VAEs tend to produce 10.1021/ci3001277. URL https://pubs.acs.org/doi/10.1021/ frequent patterns in the training dataset. In fact, MatVAE is ci3001277. more likely to produce compounds with a simple structure Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction and a low SA score than compounds with a complex struc- tree variational autoencoder for molecular graph generation. ture and a high SA score. To generate useful and novel can- 35th International Conference on Machine Learning, ICML didate compounds while maintaining structural diversity, it 2018 5: 3632–3648. will be necessary to modify the loss function in the future to include restrictions. Krenn, M.; Hase, F.; Nigam, A.; Friederich, P.; and Aspuru- Guzik, A. 2020. Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representa- References tion. Machine Learning: Science and Technology doi: Bradshaw, J.; Paige, B.; Kusner, M. J.; Benevolentai, M. H. 10.1088/2632-2153/aba947. URL https://creativecommons. S. S.; and Miguel Hernández-Lobato, J. 2019. A Model to org/licences/by/3.0. Search for Synthesizable Molecules. Technical report. URL Olivecrona, M.; Blaschke, T.; Engkvist, O.; and Chen, H. https://github.com/. 2017. Molecular de-novo design through deep reinforce- Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Em- ment learning. Journal of Cheminformatics 9(1). ISSN pirical Evaluation of Gated Recurrent Neural Networks on 17582946. doi:10.1186/s13321-017-0235-x. Sequence Modeling URL http://arxiv.org/abs/1412.3555. Osakabe, Y.; Asahara, A.; and Morita, H. 2020. Hitachi Ma- Coley, C. W.; Jin, W.; Rogers, L.; Jamison, T. F.; Jaakkola, terials Informatics Analytics Platform Assisting Rapid De- T. S.; Green, W. H.; Barzilay, R.; and Jensen, K. F. 2019. velopment. In AAAI Spring Symposium: Combining Ma- A graph-convolutional neural network model for the pre- chine Learning with Knowledge Engineering (1). diction of chemical reactivity. Chemical Science 10(2): Schwaller, P.; Gaudin, T.; Lányi, D.; Bekas, C.; and Laino, 370–377. ISSN 20416539. doi:10.1039/c8sc04228d. T. 2018. ”Found in Translation”: predicting outcomes URL https://pubs.rsc.org/en/content/articlehtml/2019/sc/ of complex organic chemistry reactions using neural c8sc04228dhttps://pubs.rsc.org/en/content/articlelanding/ sequence-to-sequence models. Chemical Science 9(28): 2019/sc/c8sc04228d. 6091–6098. ISSN 20416539. doi:10.1039/c8sc02339e. Curtarolo, S.; Hart, G. L.; Nardelli, M. B.; Mingo, N.; San- URL https://pubs.rsc.org/en/content/articlehtml/2018/sc/ vito, S.; and Levy, O. 2013. The high-throughput highway c8sc02339ehttps://pubs.rsc.org/en/content/articlelanding/ to computational materials design. Nature materials 12(3): 2018/sc/c8sc02339e. 191. Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, De Cao, N.; and Kipf, T. 2018. MolGAN: An implicit gen- C. A.; Bekas, C.; and Lee, A. A. 2019a. Molecular Trans- erative model for small molecular graphs URL http://arxiv. former: A Model for Uncertainty-Calibrated Chemical Re- org/abs/1805.11973. action Prediction. ACS Central Science 5(9): 1572–1583. ISSN 23747951. doi:10.1021/acscentsci.9b00576. URL Ertl, P.; and Schuffenhauer, A. 2009. Estimation of syn- http://pubs.acs.org/journal/acscii. thetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal Schwaller, P.; Probst, D.; Vaucher, A. C.; Nair, V. H.; of Cheminformatics 1(1): 8. ISSN 1758-2946. doi:10.1186/ Laino, T.; and Reymond, J.-L. 2019b. Data-Driven Chem- 1758-2946-1-8. URL https://jcheminf.biomedcentral.com/ ical Reaction Classification, Fingerprinting and Clustering articles/10.1186/1758-2946-1-8. using Attention-Based Neural Networks doi:10.7892/boris. 141739. URL https://doi.org/10.7892/boris.141739. Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; She- Weininger, D. 1988. SMILES, a Chemical Language and berla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, Information System: 1: Introduction to Methodology and R. P.; and Aspuru-Guzik, A. 2018. Automatic Chemical Encoding Rules. Journal of Chemical Information and Design Using a Data-Driven Continuous Representa- Computer Sciences 28(1): 31–36. ISSN 00952338. doi: tion of Molecules. ACS Central Science 4(2): 268–276. 10.1021/ci00057a005. ISSN 2374-7943. doi:10.1021/acscentsci.7b00572. URL Williams, R. J.; and Zipser, D. 1989. A Learning Algo- https://pubs.acs.org/doi/10.1021/acscentsci.7b00572. rithm for Continually Running Fully Recurrent Neural Net- works. Neural Computation 1(2): 270–280. ISSN 0899- Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; 7667. doi:10.1162/neco.1989.1.2.270. URL https://www. and Pletnev, I. 2013. InChI - The worldwide chemi- mitpressjournals.org/doix/abs/10.1162/neco.1989.1.2.270. cal structure identifier standard. doi:10.1186/1758-2946- 5-7. URL https://jcheminf.biomedcentral.com/articles/10. 1186/1758-2946-5-7. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; and Coleman, R. G. 2012. ZINC: A Free Tool to Discover Chemistry for Biology. Journal of Chemical Information