MatVAE: Independently Trained Nested Variational Autoencoder
                      for Generating Chemical Structural Formula
                                           Yoshihiro Osakabe,1 Akinori Asahara1
                                           1
                                           Hitachi, Ltd. Research and Development Group
                                yoshihiro.osakabe.fj@hitachi.com, akinori.asahara.bq@hitachi.com


                            Abstract                                 statistical models. However, there are two problems when
                                                                     solving the inverse problem to obtain the chemical structure
  Rapid materials development utilizes deep generative models
                                                                     with desired properties. The first problem is that a statistical
  to suggest candidate compounds with desirable properties be-
  fore actual experiments. Such models successfully generate         model is generally a complex nonlinear function, so an in-
  novel candidates with improved properties in some cases, but       verse function that yields a descriptor from a property value
  they usually require a large experimental dataset which is dif-    cannot be obtained explicitly. The second problem is that
  ficult to obtain. We propose MatVAE–two nested VAEs inde-          even if a descriptor is obtained, it is difficult to construct a
  pendently trained on different datasets. The first VAE, which      chemical structure containing that descriptor.
  is trained on a huge open dataset, is a universal generator of        In recent years, an approach that has been extensively
  chemical structural formulae, and the second VAE, which is         studied is to use deep generative models to directly obtain
  trained on a small experimental dataset, learns the structure–
                                                                     chemical structures that have desirable properties without
  property relation. This training framework can be understood
  as a semi-supervised learning, which is expected to enhance        calculating the descriptors. Previously, methods for training
  model transferability. We verified that MatVAE generates five      a generative adversarial network (GAN) with the reinforce-
  times more valid candidate compounds than the conventional         ment learning (RL) framework have been reported (Olive-
  un-nested single VAE model.                                        crona et al. 2017; De Cao and Kipf 2018). The major differ-
                                                                     ence between them is the representation of the compounds;
                                                                     REINVENT (Olivecrona et al. 2017) and MolGAN (De Cao
                        Introduction                                 and Kipf 2018) use text-based and graph-based representa-
Determining the optimal combination of ingredients and               tions, respectively. In RL, the atoms are attached to the main
parameters can be a costly and time-consuming process                chain in order to attain more optimal property values. How-
in product development. Materials informatics (MI) is an             ever, such a sequential method of generation is considered
emerging field that integrates informatics and materials sci-        to be challenging to apply because the property values of a
ence with the goal of greatly reducing the resources and             chemical substance can change significantly with or without
risks involved in discovering, investing in, and deploying           local substructures, such as functional groups.
new materials (Curtarolo et al. 2013). Recently, artificial in-         Another approach uses variational autoencoder (VAE)
telligence (AI) has led to improvements in MI; experimental          models; for example, JT-VAE using graph-based represen-
candidates can be narrowed down without unnecessary trial            tation (Jin, Barzilay, and Jaakkola 2018) and ChemicalVAE
and error before actual experiments to discover or create new        using text-based representation (Gómez-Bombarelli et al.
materials with desirable property values.                            2018). With VAEs, the chemical structure can be directly
   Statistically modeling the relationship between descrip-          obtained by specifying one point in its latent space. Stud-
tors and property values of a compound is a widely used              ies have shown that it is possible to optimize the property
method for predicting the property values of candidate com-          values by searching the latent space because of the conti-
pounds without conducting experiments. In drug discovery,            nuity of the space. However, it is important to note that the
this relationship is referred to as the quantitative structure–      previous studies required huge training datasets. Chemical-
activity relationship (QSAR). In general, such methodol-             VAE and REINVENT were trained using supervised learn-
ogy is based on two processes; calculating descriptors from          ing with 250,000 and 350,000 compound data extracted
the chemical structure by extracting structural features and         from the ZINC database, respectively. In most cases, when
building a model integrating the descriptors and the property        developing industrial chemical products, only a few hundred
values. The constructed statistical model can be used to pre-        or thousand supervised training data, i.e., data with property
dict property values from the chemical structures. Recently,         values measured by experiments, are available for a single
machine learning techniques have been widely used to build           product family.
Copyright © 2021for this paper by its authors. Use permitted under      The purpose of this study is to propose a method for sug-
Creative Commons License Attribution 4.0 International (CC BY        gesting viable candidate compounds with improved proper-
4.0)                                                                 ties even with a small amount of experimental data. In addi-
              Figure 1: Diagram of the proposed nested variational autoencoders used for molecular generation.


tion, great attention should be paid to the transferability of   to build a new regression model with the latent variable
the trained model because training deep generative models is     and the property values. ChemicalVAE follows the second
costly in general. Hitachi, Ltd. provides such MI solutions,     approach. To ensure the VAE generates effective candi-
such as a cloud-based IT platform for non-experts (Osak-         date molecules that have improved property values, Chem-
abe, Asahara, and Morita 2020), to several material manu-        icalVAE has an additional network, the predictor network,
facturers. The proposed method is based on semi-supervised       which estimates property values from the latent continu-
learning so that the trained model can be reused in different    ous vector z. Because of the continuity of the latent space,
projects, which is advantageous to such business.                ChemicalVAE can automatically generate novel chemical
                                                                 structures by simple operations in the latent space, such as
                     Related Works                               decoding random vectors, sampling the neighbors of known
                                                                 chemical structures, or interpolating between molecules.
Variational Autoencoder for Molecular Generation                 However, the continuity of the latent space is only guar-
Deep neural networks (DNNs) can be used to represent             anteed around the points of the known molecules, and the
highly nonlinear relationships. They have outperformed           effectiveness of searching for novel candidates expected to
other methods in various fields on regression and classi-        have more optimal property values depends on the perfor-
fication tasks. Recently, DNNs have been utilized to gen-        mance of the predictor network. In general, such a predic-
erate a novel sample similar to the training data, i.e., a       tor requires a large amount of training data including both
deep generative model. The deep generative model assumes         SMILES and corresponding property values. In most cases,
that the observed data x is generated from an unobserved         it is difficult to prepare a large experimental dataset.
latent variable z and aims to learn a transformation rule
p(x|z). One of the deep generative models is the varia-          Semi-supervised Learning for VAE
tional autoencoder (VAE), which contains two neural net-         A conditional VAE (CVAE) has been developed as an exten-
works, an encoder network, and a decoder network. The            sion of VAE, which takes into account the objective variable
VAE assumes a continuous latent variable z to be a mul-          y in the formulation of the latent variable z. Kingma et al.
tidimensional Gaussian distribution. The encoder is trained      proposed a deep generative model called the M1+M2 model,
to convert a data x to a latent variable z, and the decoder      based on CVAE architectures. The M1 model is trained with
is simultaneously trained to convert z back to x. Gómez-        a large amount of unlabeled data which lacks information on
Bombarelli et al. integrated this VAE framework into au-         the objective variable y, and then the M2 model is trained
tomatic molecular design, namely ChemicalVAE (Gómez-            with the latent variable z by mixing a few labeled data and
Bombarelli et al. 2018). Starting from a discrete molecular      a large amount of unlabeled data. This model achieved 96%
representation such as a SMILES string (Weininger 1988),         correctness in a handwriting image recognition task using
the ChemicalVAE encoder converts this discrete represen-         MNIST where only 100 images out of 70,000 images were
tation x of a molecule into a real-valued continuous vector      given the labels (correct answers). This approach can be un-
z, and its decoder converts z back to the discrete molecular     derstood as semi-supervised learning and is expected to be
representation x. With ChemicalVAE, the encoder utilizes         effective in generating molecules when limited by a small
a one-dimensional convolutional neural network (CNN) to          amount of experimental data. However, both the M1 and M2
convolve strings, and the decoder utilizes a recurrent neural    models require a large amount of unlabeled data for training.
network (RNN) to generate SMILES strings. By selecting a         When replacing the experimental data, the M2 model has to
point in the latent space and passing it through the decoder,    be trained again, which is not desirable in terms of compu-
the corresponding SMILES can be obtained directly.               tation time and diversion of the trained model.
   Two approaches can be used to generate a candidate
molecule with the desired property values. One is to only
use the trained VAE to generate the novel structure and pre-                       Proposed Method
dict the property values by calculating the descriptors with     In this paper, we propose a generative model consisting of
the conventional QSAR framework. The other approach is           two nested VAEs by introducing semi-supervised learning
  Figure 2: Diagram of learning steps; (a) steps 1 and 2 for training the outer VAE, and (b) step 3 for training the inner VAE.


similar to the M1+M2 model. To account for the diversion of
the trained model, the two VAEs are trained independently
on different datasets. Figure 1 shows the diagram of the pro-
posed model, MatVAE. The first VAE (outer VAE) is trained
to learn structure characteristics using a huge compound
structure dataset without property values, such as an open         Figure 3: Example of SMILES representation and one-hot
dataset. The second VAE (inner VAE) is trained to learn the        vectors for benzene. For simplicity, only two characters are
relation between the structure characteristics and the prop-       shown in the one-hot encoding. In practice, one-hot vectors
erty values to be improved using an experimental dataset           forms M × N matrix, where M is the number of SMILES
that includes both structure information and its property val-     symbols and N is the maximum length of SMILES strings.
ues. Unlike the M1+M2 model, the costly training with the
huge dataset only needs to be done once for the outer VAE,
and there is no need to replace the trained outer VAE if the
experimental data changes. This model is expected to gener-        up of M × N matrix, where M is the number of SMILES
ate candidate compounds with improved property values by           symbols and N is the maximum length of SMILES strings.
giving a vector similar to the existing top-level compounds        In our experiment, M is 101 and N is 90. The one-hot
paired with the desired property values.                           vector indicates the presence and absence of each symbol
                                                                   within a sequence, as illustrated in Fig. 3. InChI (Heller
Molecular Representations                                          et al. 2013) is another common representation of molecules,
                                                                   but Gómez-Bombarelli et al. reported that it is less effective
SMILES (Weininger 1988) is a common format for repre-              than SMILES with VAE architectures (Gómez-Bombarelli
senting molecules as a character sequence. With advances in        et al. 2018). Note that SELFIES (Krenn et al. 2020) is an-
natural language processing, this text-based format is widely      other text-based representation based on the combination of
used in MI applications such as predicting (Schwaller et al.       strings and graph expressions of molecules. SELFIES was
2018, 2019a; Coley et al. 2019; Bradshaw et al. 2019) and          found to be highly robust against mutations in sequences
classifying (Schwaller et al. 2019b) chemical reactions. To        and outperformed other representations (including SMILES
make SMILES capable of inputting to VAE architectures,             strings) in terms of diversity, validity, and reconstruction
the SMILES strings are encoded into one-hot vectors made           accuracy when applied to sequence-based VAEs. SELF-
                                    Figure 4: Diagram of the generation phase for MatVAE.


IES is compatible with SMILES and can be handled with            structure, the inner VAE should be able to extract such rela-
one-hot encoding similarly to SMILES. We have confirmed          tions more easily because the input dimension is reduced.
that SELFIES improves the validity of generated molecu-             The input of the inner VAE is a vector concatenated with
lar strings with MatVAE, but in this paper SMILES is used        a latent vector of the outer VAE and a vector consisting of
as an input format to clarify whether the proposed nested        property values. Therefore, the input size can be L1 + P ,
VAEs are more effective than the conventional single VAE         where P is the number of the target properties to be im-
model (Gómez-Bombarelli et al. 2018) for generating can-        proved or optimized, as shown in Fig. 2 (b).
didate compounds.                                                   The structure of the inner VAE network is as follows: both
                                                                 the inner encoder and decoder comprise multiple fully con-
Outer VAE                                                        nected layers with the latent space of width L2 . Therefore,
                                                                 the number of nodes in each layer is automatically deter-
The purpose of the outer VAE is to prepare a general gener-      mined to be equal to the ratio between L1 + P and L2 . In
ative model for creating a one-hot vector that can be trans-     this report, the latent space dimension L2 is defined to be
formed into corresponding valid SMILES strings. Since the        128, and the inner encoder and decoder are both defined to
outer VAE does not need property value information, the          have three layers.
SMILES training dataset can be easily curated from any
open dataset such as ZINC (Irwin et al. 2012). All of the        Learning and Generating Procedures
SMILES are converted into one-hot vectors before they are        The inner and the outer VAEs are trained independently.
input to the outer VAE.                                          Again, the required datasets for training the two VAEs are
   The structure of the outer VAE network is based on Chem-      (1) the structure dataset containing only SMILES and cu-
icalVAE (Gómez-Bombarelli et al. 2018) as follows: the          rated from open datasets, and (2) the experimental dataset
outer encoder utilizes the three one-dimensional convolu-        containing both SMILES and experimental property values.
tional layers to convolve strings, where the filter sizes are       Figure 2 shows the following three steps which make up
9, 9, 10 and the convolution kernels are 9, 9, 11, followed by   the learning procedure:
one fully connected layer of dimension L1 = 196, where
L1 is the size of the latent space of outer VAE. The de-         Step 1 Train outer VAE with structure dataset
coder utilizes three layers of gated recurrent unit (GRU) net-   Step 2 Train outer VAE with SMILES in experimen-
works (Chung et al. 2014) with a hidden dimension of 488.          tal dataset
The final layer of the decoder outputs a probability distribu-   Step 3 Train inner VAE with experimental dataset
tion over all possible symbols at each position in a SMILES
string. As a result, the same point in latent space can be          At a glance, Step 2 appears to be redundant. However, it is
decoded into a different SMILES string, and the generated        necessary to ensure the continuity in the latent spaces even
string may be invalid. Similarly to (Gómez-Bombarelli et al.    around compounds in the experimental dataset. Unlike the
2018), the GRU layer is updated to improve its performance       M1+M2 model, the inner VAE is only trained on the labeled
by using additional input (Williams and Zipser 1989).            data with the experimental dataset. Hence, if the outer VAE
                                                                 does not learn those compounds, the two VAEs cannot work
Inner VAE                                                        together in the generation phase. In fact, the generation of
                                                                 candidate compounds with the nested VAEs without Step 2
The purpose of the inner VAE is to learn the structure–          resulted in almost no valid SMILES generated.
property relations. By using the latent vector of the trained       Figure 4 illustrates the generation phase. To generate the
outer VAE instead of one-hot vectors to express molecular        candidate compounds expected to have improved proper-
  Role           Size   Components           SA Score Range
  Structure      500k   SMILES               –
  Experimental   1k     SMILES, SA Score     si ≥ 2.05
  Top-level      0.1k   SMILES               2.05 ≥ si ≥ 2.00

Table 1: Overview of the training datasets in the experiment.


ties, we use the vector with the width L1 + P consisting
of the latent vector similar to the existing top-level com-
pounds paired with the desired property values. Candidate
compounds can then be generated more efficiently by com-
prehensively combining multiple values for a single latent
vector within a range close to the target property value.

                        Experiment                                Figure 5: The number of optimal and valid candidates gen-
Setup                                                             erated by the conventional ChemicalVAE and the proposed
Candidate compounds generated by MatVAE are not guar-             MatVAE.
anteed to be synthesizable even if they are valid according
to the SMILES grammar. In this experiment, the synthetic
accessibility (SA) score (Ertl and Schuffenhauer 2009) is         inal model. This is because ChemicalVAE is not intended
used instead of the actual chemical properties so that the true   to be trained on such a small amount of experimental data.
value for the unknown SMILES can be calculated without            In fact, when the number of experimental data is increased
conducting actual experiments. The SA score is an index of        to at least 5000, ChemicalVAE outperforms MatVAE. Thus,
ease of fabrication and is uniquely determined for any valid      for a small amount of experimental data, MatVAE can pro-
SMILES.                                                           duce more than five times as many optimal compounds as
   One of the objectives of MatVAE is to generate can-            ChemicalVAE.
didate compounds that have improved property values. In
other words, the model is expected to generate SMILES                                    Conclusion
with property values outside the range of the given experi-       We have proposed a new deep generative model, MatVAE,
mental dataset. To evaluate MatVAE, we deliberately limited       for producing molecules with optimal property values. The
the range of SA scores included in the experimental train-        model consists of two variational autoencoders which are
ing dataset, as shown in Table 1. The experimental dataset        trained independently on different datasets. The first (outer)
is limited to include compounds in which si is larger than        VAE aims to capture the structure characteristics according
2.05, where si denotes the SA score of the i-th compounds         to the grammar of a text-based representation of molecules
in the dataset. This means that the inner VAE cannot learn        such as SMILES. Even if the type of target properties
the structure–property relation in the range below 2.05. One      changes, this VAE does not need to be replaced, making it
hundred compounds with 2.05 ≥ si ≥ 2.00 were curated              transferable to other applications. The second (inner) VAE
for a top-level compounds dataset used as the VAE input in        can learn structure–property relations directly using the ex-
the generation phase. Note that all of the compounds in this      perimental dataset because the first VAE extracts structural
experiment were curated from the ZINC database.                   features of molecules by reducing the dimensions of the
   For the evaluation, the proposed model generated candi-        one-hot vectors converted from SMILES. Therefore, it can
date compounds repeatedly until valid SMILES are gener-           generate new molecule representations by inputting existing
ated, and the number of the generated compounds with SA           top-level compounds structure data paired with the desired
scores below 2.0 was counted. If MatVAE is able to gener-         property values. Compared with the common optimization
ate such compounds, it would be confirmed that the model          methodology for VAE, which utilizes gradient-based search
successfully suggests unknown compounds that exceed the           in its latent space, the proposed search method is more
existing top-level compounds.                                     straightforward and user-friendly.
                                                                     There are a number of possible improvements that can be
Results                                                           made to MatVAE. In this study, we used text-based molec-
Figure 5 shows the number of valid, optimal compounds             ular encoding and a GRU decoder. Such architecture may
from the 100 suggested valid candidates, and the red line         make it unnecessarily difficult to produce valid SMILES
indicates the performance of MatVAE. Even when the ex-            strings. It is possible to use a graph-based representation in-
perimental dataset is small, MatVAE successfully generates        stead of text-based, or to change the sequence-based VAE
valid SMILES with an improved SA score, though the ratio          to graph-based. However, even without changing the net-
is not high. The blue line indicates the performance of the       work configuration, converting SMILES to one-hot vectors
conventional method, ChemicalVAE, which is considerably           via SELFIES representation could prevent the generation of
lower than that of the original results, though all parame-       invalid SMILES. We have already verified the effectiveness
ters and hyperparameters are the same as those of the orig-       of this approach in other experiments.
   Generated compounds are likely to have similar substruc-      and Modeling 52(7): 1757–1768. ISSN 1549-9596. doi:
tures, such as carbon chains, because VAEs tend to produce       10.1021/ci3001277. URL https://pubs.acs.org/doi/10.1021/
frequent patterns in the training dataset. In fact, MatVAE is    ci3001277.
more likely to produce compounds with a simple structure         Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction
and a low SA score than compounds with a complex struc-          tree variational autoencoder for molecular graph generation.
ture and a high SA score. To generate useful and novel can-      35th International Conference on Machine Learning, ICML
didate compounds while maintaining structural diversity, it      2018 5: 3632–3648.
will be necessary to modify the loss function in the future to
include restrictions.                                            Krenn, M.; Hase, F.; Nigam, A.; Friederich, P.; and Aspuru-
                                                                 Guzik, A. 2020.      Self-Referencing Embedded Strings
                                                                 (SELFIES): A 100% robust molecular string representa-
                       References                                tion. Machine Learning: Science and Technology doi:
Bradshaw, J.; Paige, B.; Kusner, M. J.; Benevolentai, M. H.      10.1088/2632-2153/aba947. URL https://creativecommons.
S. S.; and Miguel Hernández-Lobato, J. 2019. A Model to         org/licences/by/3.0.
Search for Synthesizable Molecules. Technical report. URL
                                                                 Olivecrona, M.; Blaschke, T.; Engkvist, O.; and Chen, H.
https://github.com/.
                                                                 2017. Molecular de-novo design through deep reinforce-
Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Em-       ment learning. Journal of Cheminformatics 9(1). ISSN
pirical Evaluation of Gated Recurrent Neural Networks on         17582946. doi:10.1186/s13321-017-0235-x.
Sequence Modeling URL http://arxiv.org/abs/1412.3555.            Osakabe, Y.; Asahara, A.; and Morita, H. 2020. Hitachi Ma-
Coley, C. W.; Jin, W.; Rogers, L.; Jamison, T. F.; Jaakkola,     terials Informatics Analytics Platform Assisting Rapid De-
T. S.; Green, W. H.; Barzilay, R.; and Jensen, K. F. 2019.       velopment. In AAAI Spring Symposium: Combining Ma-
A graph-convolutional neural network model for the pre-          chine Learning with Knowledge Engineering (1).
diction of chemical reactivity. Chemical Science 10(2):          Schwaller, P.; Gaudin, T.; Lányi, D.; Bekas, C.; and Laino,
370–377.     ISSN 20416539.        doi:10.1039/c8sc04228d.       T. 2018. ”Found in Translation”: predicting outcomes
URL      https://pubs.rsc.org/en/content/articlehtml/2019/sc/    of complex organic chemistry reactions using neural
c8sc04228dhttps://pubs.rsc.org/en/content/articlelanding/        sequence-to-sequence models. Chemical Science 9(28):
2019/sc/c8sc04228d.                                              6091–6098. ISSN 20416539. doi:10.1039/c8sc02339e.
Curtarolo, S.; Hart, G. L.; Nardelli, M. B.; Mingo, N.; San-     URL     https://pubs.rsc.org/en/content/articlehtml/2018/sc/
vito, S.; and Levy, O. 2013. The high-throughput highway         c8sc02339ehttps://pubs.rsc.org/en/content/articlelanding/
to computational materials design. Nature materials 12(3):       2018/sc/c8sc02339e.
191.                                                             Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter,
De Cao, N.; and Kipf, T. 2018. MolGAN: An implicit gen-          C. A.; Bekas, C.; and Lee, A. A. 2019a. Molecular Trans-
erative model for small molecular graphs URL http://arxiv.       former: A Model for Uncertainty-Calibrated Chemical Re-
org/abs/1805.11973.                                              action Prediction. ACS Central Science 5(9): 1572–1583.
                                                                 ISSN 23747951. doi:10.1021/acscentsci.9b00576. URL
Ertl, P.; and Schuffenhauer, A. 2009. Estimation of syn-         http://pubs.acs.org/journal/acscii.
thetic accessibility score of drug-like molecules based on
molecular complexity and fragment contributions. Journal         Schwaller, P.; Probst, D.; Vaucher, A. C.; Nair, V. H.;
of Cheminformatics 1(1): 8. ISSN 1758-2946. doi:10.1186/         Laino, T.; and Reymond, J.-L. 2019b. Data-Driven Chem-
1758-2946-1-8. URL https://jcheminf.biomedcentral.com/           ical Reaction Classification, Fingerprinting and Clustering
articles/10.1186/1758-2946-1-8.                                  using Attention-Based Neural Networks doi:10.7892/boris.
                                                                 141739. URL https://doi.org/10.7892/boris.141739.
Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.;
Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; She-           Weininger, D. 1988. SMILES, a Chemical Language and
berla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams,      Information System: 1: Introduction to Methodology and
R. P.; and Aspuru-Guzik, A. 2018. Automatic Chemical             Encoding Rules. Journal of Chemical Information and
Design Using a Data-Driven Continuous Representa-                Computer Sciences 28(1): 31–36. ISSN 00952338. doi:
tion of Molecules. ACS Central Science 4(2): 268–276.            10.1021/ci00057a005.
ISSN 2374-7943. doi:10.1021/acscentsci.7b00572. URL              Williams, R. J.; and Zipser, D. 1989. A Learning Algo-
https://pubs.acs.org/doi/10.1021/acscentsci.7b00572.             rithm for Continually Running Fully Recurrent Neural Net-
                                                                 works. Neural Computation 1(2): 270–280. ISSN 0899-
Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.;
                                                                 7667. doi:10.1162/neco.1989.1.2.270. URL https://www.
and Pletnev, I. 2013. InChI - The worldwide chemi-
                                                                 mitpressjournals.org/doix/abs/10.1162/neco.1989.1.2.270.
cal structure identifier standard. doi:10.1186/1758-2946-
5-7. URL https://jcheminf.biomedcentral.com/articles/10.
1186/1758-2946-5-7.
Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.;
and Coleman, R. G. 2012. ZINC: A Free Tool to Discover
Chemistry for Biology. Journal of Chemical Information