=Paper=
{{Paper
|id=Vol-2350/paper18
|storemode=property
|title=CrystalGAN: Learning to Discover Crystallographic Structures with Generative Adversarial Networks
|pdfUrl=https://ceur-ws.org/Vol-2350/paper18.pdf
|volume=Vol-2350
|authors=Asma Nouira,Nataliya Sokolovska,Jean-Claude Crivello
|dblpUrl=https://dblp.org/rec/conf/aaaiss/NouiraSC19
}}
==CrystalGAN: Learning to Discover Crystallographic Structures with Generative Adversarial Networks==
CrystalGAN: Learning to Discover Crystallographic Structures with Generative
Adversarial Networks
Asma Nouira1 , Nataliya Sokolovska2 , Jean-Claude Crivello1
1
University Paris Est, ICMPE (UMR 7182)
CNRS, UPEC, F-94320 Thiais, France
2
Sorbonne University, INSERM, NutriOmics team, Paris France
Abstract the progress in materials science. Machine learning meth-
ods, namely, generative models, are reported to be efficient
Our main motivation is to propose an efficient approach
to generate novel multi-element stable chemical compounds in new data generation (Friedman, Tibshirani, and Hastie,
that can be used in real world applications. This task can be 2009), and nowadays we have access to both, techniques to
formulated as a combinatorial problem, and it takes many generate a huge amount of new chemical compounds, and to
hours of human experts to construct, and to evaluate new data. test the properties of all these candidates.
Unsupervised learning methods such as Generative Adversar- In this work, we focus on applications of hydrogen stor-
ial Networks (GANs) can be efficiently used to produce new age, and in particular, we challenge the problem to investi-
data. Cross-domain Generative Adversarial Networks were gate novel chemical compositions with stable crystals. Tra-
reported to achieve exciting results in image processing ap- ditionally, density functional theory (DFT) plays a central
plications. However, in the domain of materials science, there
is a need to synthesize data with higher order complexity
role in prediction of chemically relevant compositions with
compared to observed samples, and the state-of-the-art cross- stable crystals (Seko et al., 2018). However, the DFT calcu-
domain GANs can not be adapted directly. In this contribu- lations are computationally expensive, and it is not accept-
tion, we propose a novel GAN called CrystalGAN which gen- able to apply it to test all possible randomly generated struc-
erates new chemically stable crystallographic structures with tures.
increased domain complexity. We introduce an original ar- A number of machine learning approaches were proposed
chitecture, we provide the corresponding loss functions, and to facilitate the search for novel stable compositions (But-
we show that the CrystalGAN generates very reasonable data. ler et al., 2018). There was an attempt to find new com-
We illustrate the efficiency of the proposed method on a real
positions using an inorganic crystal structure database, and
original problem of novel hydrides discovery that can be fur-
ther used in development of hydrogen storage materials. to estimate the probabilities of new candidates based on
Keywords: Generative Adversarial Nets, Cross-Domain
compositional similarities. These methods to generate rel-
Learning, Materials Science, Higher-order Complexity. evant chemical compositions are based on recommender
systems (Hu, Koren, and Volinsky, 2008). The output of
the recommender systems applied in the crystallographic
Introduction field is a rating or preference for a structure. A recent ap-
In modern society, a big variety of inorganic composi- proach based on a combination of machine learning meth-
tions are used for hydrogen storage owing to its favor- ods and the high-throughput DFT calculations allowed to
able cost (Crivello et al., 2016). A vast number of organic explore ternary chemical compounds (Schmidt et al., 2018),
molecules are applied in solar cells, such as organic light- and it was shown that statistical methods can be of a big
emitting diodes, conductors, and sensors (Yang et al., 2017). help to identify stable structures, and that they do it much
Synthesis of new organic and inorganic compounds is a chal- faster than standard methods. Recently, support vector ma-
lenge in physics, chemistry and in materials science. De- chines were tested to predict crystal structures (Oliynyk et
sign of new structures aims to find the best solution in a big al., 2017) showing that the method can reliably predict the
chemical space, and it is in fact a combinatorial optimization crystal structure given its composition. It is worth mention-
problem. ing that data representation of observations to be passed to
The number of applications of data mining methods in a learner, is critical, and data representations which are the
chemistry and materials science increases steadily (Seko, most suitable for learning algorithms, are not necessarily sci-
Togo, and Tanaka, 2017). There is a hope that recent devel- entifically intuitive (Swann et al., 2018).
opments in machine learning and data mining will accelerate Deep learning methods were reported to learn rich hierar-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. chical models over all kind of data, and the GANs (Good-
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of fellow et al., 2014) is a state-of-the-art model to synthe-
the AAAI 2019 Spring Symposium on Combining Machine Learn- size data. Moreover, deep networks were reported to learn
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford transferable representations (Ren and Lee, 2017). The GANs
University, Palo Alto, California, USA, March 25-27, 2019. were already exploited with success in cross-domain learn-
ing applications for image processing (Zhu et al., 2017; Kim Learning cross domain relations is an active research di-
et al., 2017; Janz et al., 2017). rection in image processing. Several recent papers (Zhu et
Our goal is to develop a competitive approach to identify al., 2017; Kim et al., 2017; Almahairi et al., 2018) discuss
stable ternary chemical compounds, i.e., compounds con- an idea to capture some particular characteristics of one im-
taining three different elements, from observations of binary age and to translate them into another image. This problem
compounds. is formalized as image-to-image translation, and there exist
Nowadays, there does not exist any approach that can multiple applications, e.g., converting a grayscale image to
be applied directly to such an important task of materials a color image, or converting an image from one representa-
science. The state-of-the-art GANs are limited in the sense tion of a given scene to another. The state-of-the-art meth-
that they do not generate samples in domains with increased ods of Zhu et al. (2017); Kim et al. (2017) are based on the
complexity, e.g., the application where we aim to construct property that the translation has to be cycle consistent. If a
crystals with three elements from observations containing translator G : A → B is used, then there exist another trans-
two chemical elements only. An attempt to learn many-to- lator F : B → A so that G and F are inverse of each other,
many mappings was recently introduced by Almahairi et al. and the mappings are bijective. The mappings G and F are
(2018), however, this promising approach does not allow to trained simultaneously under the cycle consistency assump-
generate data of a higher-order dimension. tion what encourages F(G(x)) ≈ x, and G(F(x0 )) ≈ x0 .
Our contribution is multi-fold: The objective function includes the adversarial losses on do-
• To our knowledge, we are the first to introduce a GAN to mains A and B, and the cycle consistency loss.
solve the scientific problem of discovery of novel crystal A conditional GAN for image-to-image translation is con-
structures, and we introduce an original methodology to sidered by Isola et al. (2017). An advantage of the condi-
generate new stable chemical compositions; tional model is that it allows to integrate underlying struc-
ture into the model. The conditional GANs were also used
• The proposed method is called CrystalGAN, and it con-
for multi-model tasks (Mirza and Osindero, 2014). An idea
sists of two cross-domain GAN blocks with constraints
to combine observed data to produce new data was proposed
integrating prior knowledge including a feature transfer
in (Yazdani, 2017), e.g., an artist can mix existing pieces of
step;
music to create a new one.
• The proposed model generates data with increased com- An approach to learn high-level semantic features, and to
plexity with respect to observed samples; train a model for more than a single task, was introduced
• We demonstrate by numerical experiments on a real chal- by Ren and Lee (2017). In particular, it was proposed to train
lenge of chemistry and materials science that our ap- a model to jointly learn several complementary tasks. This
proach is competitive compared to existing methods; method is expected to overcome the problem of overfitting
• The proposed algorithm is efficiently implemented in to a single task. An idea to introduce multiple discriminators
Python, and it will be publicly available shortly. whose role varies from formidable adversary to forgiving
teacher was discussed by Durugkar, Gemp, and Mahadevan
This paper is organized as follows. First, we discuss the (2017).
related work. Second, we provide the formalization of the Several GANs were adapted to some materials sci-
problem, and introduce the CrystalGAN. The results of our ence and chemical applications. So, Objective-Reinforced
numerical experiments are shown in the experimental sec- GANs that perform molecular generation of carbon-chain
tion. Concluding remarks and perspectives close the paper. sequence taking into consideration some desired proper-
ties, were introduced in (Sanchez-Lengeling et al., 2017),
Related Work and the method was shown to be efficient for drug discov-
Our contribution is closely related to the problems of un- ery. Another avenue is to integrate rule-based knowledge,
supervised learning and cross-domain learning, since our e.g., molecular descriptors with the deep learning. Chem-
aim is to synthesize novel data, and the new samples are Net (Goh et al., 2017) is a deep neural network pre-trained
supposed to belong to an unobserved domain with an aug- with chemistry-relevant representations obtained from prior
mented complexity. knowledge. The model can be used to predict new chemical
In the adversarial nets framework, the deep generative properties. However, as we have already mentioned before,
models compete with an adversary which is a discrimi- none of these methods generates crystal data of augmented
native model learning to identify whether an observation complexity.
comes from the model distribution or from the data distribu-
tion (Goodfellow, 2016). A classical GAN consists of two
models, a generator G whose objective is to synthesize data CrystalGAN: an Approach to Generate Stable
and a discriminator D whose aim is to distinguish between Ternary Chemical Compounds
real and generated data. The generator and the discriminator In this section, we introduce our approach. The CrystalGAN
are trained simultaneously, and the training problem is for- consists of three procedures:
mulated as a two-player minimax game. A number of tech-
niques to improve training of GANs were proposed by Ar- 1. First step GAN which is closely related to the cross-
jovsky, Chintala, and Bottou (2017); Gulrajani et al. (2017); domain GANs, and that generates pseudo-binary samples
Salimans et al. (2016). where the domains are mixed.
AH: First domain, H is hydrogen, and A is a metal
BH: Second domain, H is hydrogen, and B is another metal
GAHB1 : Generator function that translates input features xAH from (domain) AH to BH
GBHA1 : Generator function that translates input features xBH from (domain) BH to AH
DAH and DBH : Discriminator functions of AH domain and BH domain, respectively
AHB1 : xAHB1 is a sample generated by generator function GAHB1
BHA1 : yBHA1 is a sample produced by generator function GBHA1
AHBA1 and BHAB1 : Data reconstructed after two generator translations
AHBg and BHAg : Data obtained after feature transfer step from domain AH to domain BH,
and from domain BH to domain AH, respectively
Input data for the second step of CrystalGAN
GAHB2 : Generator function that translates xAHBg
Features generated in the first step from AHBg to AHB2
GBHA2 : Generator function that translates yBHAg
Data generated in first step from BHAg to BHA2
DAHB and DBHA : The discriminator functions of domain AHBg and domain BHAg , respectively
AHB2 : xAHB2 is a sample generated by the generator function GAHB2
BHA2 : yBHA2 is a sample produced by the generator function GBHA2
AHBA2 and BHAB2 : Data reconstructed as a result of two generators translations
AHB2 and BHA2 : Final new data (to be explored by human experts)
Table 1: Notations used in CrystalGAN.
2. Feature transfer procedure constructs higher order com- The cross-domain GANs were shown to be efficient to dis-
plexity data from the samples generated at the previous cover relations between two different domains from un-
step, and where components from all domains are well- paired samples, without any explicit labels, and to find a
separated. mapping from one domain to another. However, neither
3. Second step GAN synthesizes, under geometric con- DiscoGAN, or CycleGAN are not able to generate data with
straints, novel ternary stable chemical structures. increased complexity.
First, we describe a cross-domain GAN, and then, we pro- Problem Formulation for Applications with
vide all the details on the proposed CrystalGAN. We provide
Augmented Complexity
all notations used by the CrystalGAN in Table 1. The GANs
architectures for the first and the second steps are shown on We now propose a novel architecture based on the cross-
Figure 1. domain GAN algorithms with constraint learning to discover
higher order complexity crystallographic systems. We intro-
A Cross-Domain GAN: Problem Formulation duce a GAN model to find relations between different crys-
DiscoGAN (Kim et al., 2017) and CycleGAN (Zhu et al., tallographic domains, and to generate new materials.
2017) propose a promising modification compared to the To make the paper easier to follow, without loss of gen-
classic GAN: the model does not take the noise but samples erality, we will present our method providing a specific ex-
from another domain, resulting in cross-domain learning. ample of generating ternary hydride compounds of the form
We consider a function GABZ that maps elements from ”A (a metal) - H (hydrogen) - B (a metal)”.
domains A and B to domain Z which includes the co- The training algorithm observes stable binary compounds
domains A and B. In an unsupervised learning scenario, containing chemical elements A+H which is a composition
GABZ can be arbitrarily defined, however, to apply it to of some metal A and the hydrogen H, and B+H which
real-world applications, some conditions on the relation of is a mixture of another metal B with the hydrogen. So,
interest have to be well-defined. a machine learning algorithm has access to observations
In an idealistic setting, the equality {(xAHi )}N NBH
i=1 and {(yBHi )}i=1 . Our goal is to generate
AH
novel ternary, i.e. more complex, stable data xAHB (or
GABZ ◦ GZAB (xA , xB ) = (xA , xB ) (1) yBHA ) based on the properties learned from the observed
is satisfied. However, this constraint is a hard constraint, it binary structures.
is not straightforward to optimize it, and a relaxed soft con- We describe the architecture of the CrystalGAN on Fig-
straint is preferred. As a soft constraint, we can consider the ure 1.
distance
Steps of CrystalGAN
d (GABZ ◦ GZAB (xA , xB ), (xA , xB )) , (2)
Our approach consists of two consecutive steps with a fea-
and minimize it using a metric function such as L1 or L2 . ture transfer procedure inbetween.
First Step The first step of CrystalGAN generates new
−ExA ,xB ∼PA,B [log DZ (GABZ )(xA , xB )] . (3) data with increased complexity. The adversarial network
takes {(xAHi )}N NBH
i=1 and {(yBHi )}i=1 , and synthesizes
AH
xAHB 1 = GAHB 1 (xAH ), (4)
xAHBA1 = GBHA1 (xAHB 1 ) = GBHA1 ◦ GAHB 1 (xAH ).
(5)
and
yBHA1 = GBHA1 (yBH ), (6)
yBHAB 1 = GAHB 1 (yBHA1 ) = GAHB 1 ◦ GBHA1 (yBH ).
(7)
Figure 1a summarizes the first step of CrystalGAN.
The reconstruction loss functions take the following form:
LRAH = d(xAHBA1 , xAH ) = d(GBHA1 ◦ GAHB1 (xAH ), xAH ),
(8)
LRBH = d(yBHAB1 , yBH ) = d(GAHB1 ◦ GBHA1 (yBH ), yBH ).
(a) First step of CrystalGAN. (9)
Ideally, LRAH = 0, LRBH = 0, and xAHBA1 =
xAH , yBHAB1 = yBH , and we minimize the distances
d(xAHBA1 , xAH ) and d(yBHAB1 , yBH ).
The generative adversarial loss functions of the first step
of CrystalGAN aim to control that the original observations
are reconstructed as accurate as possible:
LGANBH = −ExAH ∼PAH [log(DBH (GAHB1 (xAH )))],
(10)
and
LGANAH = −EyBH ∼PBH [log(DAH (GBHA1 (yBH )))].
(11)
The generative loss functions contain the two terms de-
fined above:
LGAHB1 = LGANBH + LRAH , (12)
LGBHA1 = LGANAH + LRBH . (13)
The discriminative loss functions aim to discriminate the
samples coming from AH and BH:
(b) Second step of CrystalGAN
LDBH = − EyBH ∼PBH [log(DBH (yBH ))] (14)
Figure 1: The CrystalGAN architecture. − ExAH ∼PAH [log(1 − DBH (GAHB1 (xAH )))],
LDAH = − ExAH ∼PAH [log(DAH (xAH ))] (15)
− EyBH ∼PBH [log(1 − DAH (GBHA1 (yBH )))].
Now, we have all elements to define the full generative loss
function of the first step:
LG1 = LGAHB1 + LGBHA1 (16)
= λ1 LGANBH + λ2 LRAH + λ3 LGANAH + λ4 LRBH ,
where λ1 , λ2 , λ3 , and λ4 are real-valued hyper-parameters
that control the ratio between the corresponding terms, and
the hyper-parameters are to be fixed by cross-validation.
Figure 2: Encoding of xAH and yBH with placeholders. The full discriminator loss function of this step LD1 is
defined as follows:
LD1 = LDAH + LDBH . (17)
Feature Transfer The first step generates pseudo-binary xAHBA2 = GBHA2 (xAHB2 ) = GBHA2 ◦ GAHB2 (xAHBg ).
samples M H, where M is a new discovered domain merg- (22)
ing A and B properties. Although these results can be inter-
esting for human experts, the samples generated by the first and
step are not easy to interpret, since the domains A and B yBHA2 = GBHA2 (yBHAg ), (23)
are completely mixed in these samples, and there is no way
to deduce characteristics of two separate elements coming yBHAB2 = GAHB2 (yBHA2 ) = GAHB2 ◦ GBHA2 (yBHAg ).
from these domains. (24)
So, we need a second step which will generate data of a
higher order complexity from two given domains. We trans- The reconstruction loss functions are given:
fer the attributes of A and B elements, this procedure is also LRAHB = d(xAHBA2 , xAHBg ) (25)
shown on Figure 1a, in order to construct a new dataset that = d(GBHA2 ◦ GAHB2 (xAHBg ), xAHBg ),
will be used as a training set in the second step of the Crys-
talGAN. LRBHA = d(yBHAB2 , yBHAg ) (26)
In order to prepare the datasets to generate higher order
complexity samples, we add a placeholder. (E.g., for do- = d(GAHB2 ◦ GBHA2 (yBHAg ), yBHAg ).
main AH, the fourth matrix is empty, and for domain BH, The generative adversarial loss functions are given by:
the third matrix is empty.) This implementation detail is
sketched on Figure 2. LGANBHAg = −ExAHBg ∼PAHBg [log(DBHA (GAHB2 (xAHBg )))],
(27)
Second Step of the CrystalGAN The second step GAN
LGANAHBg = −EyBHAg ∼PBHAg [log(DAHB (GBHA2 (yBHAg )))].
takes as input the data generated by the first step GAN and
modified by the feature transfer procedure. The results of (28)
the second step are samples which describe ternary chemi- The generative loss functions of the this step are defined as
cal compounds that are supposed to be stable from chemical follows:
viewpoint. The geometric constraints control the quality of
generated data. LGAHB2 = LGANBHAg + LRAHB , (29)
A crystallographic structure is fully described by a lo-
cal distribution. This distribution is determined by distances LGBHA2 = LGANAHBg + LRBHA . (30)
to all nearest neighbors of each atom in a given crystallo- The losses of the discriminator of the second step can be
graphic structure. We enforce the second step GAN with the defined:
following geometric constraints which satisfy the geometric
conditions of our scientific domain application. The imple- LDBHA = − EyBHAg ∼PBHAg [log(DBHA (yBHAg ))] (31)
mented constraints are also shown on Figure 1b. −ExAHBg ∼PAHBg [log(1 − DBHA (GAHB2 (xAHBg )))],
Let S = {si }m i=1 be the set of distances of the first neigh-
bors of all atoms in a crystallographic structure. There are LDAHB = − ExAHBg ∼PAHBg [log(DAHB (xAHBg ))] (32)
two geometric constraints to be considered while generating −EyBHAg ∼PBHAg [log(1 − DAHB (GBHA2 (yBHAg )))].
new data.
The first geometric (geo) constraint is defined as follows: Now, we have all elements to define the full generative
loss function:
Lgeo1 = f (d1 , s1 , ..., sm ) = min k d1 − s k22 , (18) LG2 = LGAHB2 + LGBHA2 + Lgeo (33)
s∈S
where d1 is the minimal distance between two first nearest = λ1 LGANBHAg + λ2 LRAHB + λ3 LGANAHBg +λ4 LRBHA
neighbors in a given crystallographic structure. +λ5 Lgeo1 + λ6 Lgeo2 ,
The second geometric constraint takes the following
where λ1 , λ2 , λ3 , λ4 , λ5 , and λ6 are real-valued hyper-
form:
parameters that control the influence of the terms.
Lgeo2 = f (d2 , s1 , ..., sm ) = − min k d2 − s k22 , (19) The full discriminative loss function of the second step
s∈S LD2 takes the form:
where d2 is the maximal distance between two first nearest LD2 = LDAHB + LDBHA . (34)
neighbors.
To summarize, in the second step, we use the dataset is-
The loss function of the second step GAN is augmented
sued from the feature transfer as an input containing two do-
by the following geometric constraints:
mains xAHBg and yBHAg . We train the cross-domain GAN
Lgeo = Lgeo1 + Lgeo2 . (20) taking into consideration constraints of the crystallographic
environment. We integrated geometric constraints proposed
Given xAHBg and yBHAg from the previous step, we gen- by crystallographic and materials science experts to satisfy
erate: environmental constraints, and to increase the rate of synthe-
sized stable ternary compounds. The second step of Crystal-
xAHB2 = GAHB2 (xAHBg ), (21) GAN is drafted on Figure 1b.
GAN Architecture
A generator network is defined as GAHB1 : Rl×m l×m
AH , RBH →
Rk×m
AHB1 , where AH, BH are the input domains, AHB1 is the
output domain, and l and m are the dimensions of the input,
k and m dimensions of output samples.
The discriminator network is denoted as DAH :
Rk×m
AHB1 → [0, 1], and it discriminates samples in domain
AHB1 . Each generator takes an observation of the size l×m, Figure 3: An example of a POSCAR file describing the com-
and passes it to the encoder-decoder pair. Note that GBHA1 , position of Palladium and Hydrogen, and the data represen-
GAHB2 , GBHA2 , DBH , DAHB and DBHA are similarly de- tation in the CrystalGAN.
fined. The encoder and the decoder are composed of fully-
connected layers. The number of layers ranges from 5 to 10
depending on a domain. The discriminator has an additional
layer, a sigmoid function to output a predicted label. tally observed prototypes. Here is a brief data description for
this task:
Experiments Input dataset Dimension
Task Description: Exploring Novel Hydrides PdH [35, 4, 18, 3]
NiH [35, 4, 18, 3]
Hydrides, compounds which associate hydrogen atoms with
other chemical elements, are actively used in storage battery where 18 and 3 are the maximal numbers of lines and
technologies such as nickel-metal hydride battery. A num- columns in each matrix respectively.
ber of hydrides have been explored as a means of hydrogen In the CrystalGAN, we need to compute all the distances
storage for fuel-cell powered electric cars. of the nearest neighbors for each generated POSCAR file.
Crystallographic structures can be represented using the The distances between hydrogen atoms H in a given crys-
POSCAR files which are input files for the DFT calculations tallographic structure should respect some geometric rules,
under the VASP code (Kresse and Joubert, 1999). These are as well as the distances between the atoms A − B, A − A’,
coordinate files, they contain the lattice geometry and the and B − B 0 . We applied the geometric constraints on the
atomic positions, as well as the number (or the composition) distances between the neighbors (for each atom in a crystal-
and the nature of atoms in the crystal unit cell. lographic structure) introduced in the previous section. Note
We use a dataset constructed from (Bourgeois et al., 2017; that the distances A−H and B−H are not penalized by the
Villars and Cenzual, 2017) by experts in materials science. constraints.
Our training data set contains the POSCAR files, and the
proposed CrystalGAN generates also POSCAR files. Such a
file contains three matrices: the first one is abc matrix, corre- Implementation Details
sponding to the three lattice vectors defining the unit cell of
the system, the second matrix contains atomic positions of H In order to compute the distances between all nearest neigh-
atom, and the third matrix contains coordinates of metallic bors in the generated data, we used the pythonic library Py-
atom A (or B). The information from the files is fed into 4- matgen (Ong et al., 2012) specifically developed for material
dimensional tensors. An example of a POSCAR file, and its analysis.
corresponding representation for the GANs is shown on Fig- For all experiments in this paper, the distances are fixed
ure 3. On Figure 4 we show the corresponding structure in by our colleagues in crystallographic and materials science
3D. Note that we increase the data complexity by the feature to d1 = 1.8 Å (angstrom, 10−10 meter) and d2 = 3 Å. We
transfer procedure by adding placeholders. set all the hyper-parameters by cross validation, however, we
Our training dataset includes 1,416 POSCAR files of bi- found that a reasonable performance is reached when all λi
nary hydrides divided into 63 classes where each class is have similar values, and are quite close to 1. We use the stan-
represented as a 4-dimensional tensor. Each class of binary dard AdamOptimizer with learning rate α = 0.0001, and
M H hydride contains two elements: the hydrogen H and an- β1 = 0.5. The number of epochs is set to 1000 (we verified
other element M from the periodic table. This later is se- that the functions converge). The mini-batch size equals 35.
lected from the 63 highlighted M elements (in yellow) in the Each block of the CrystalGAN architecture (the genera-
Figure 7. In our experiments, after discussions with materi- tors and the discriminators) is a multi-layer neural network
als science researchers, we focused on exploration of ternary with 5 hidden layers. Each layer contains 100 units. We use
compositions ”Palladium - Hydrogen - Nickel” from the bi- the rectified linear unit (ReLU) as an activation function
nary systems observations of ”Palladium - Hydrogen” and of the neural network. All these parameters were fixed by
”Nickel - Hydrogen”. So, AH = PdH, and BH = NiH. We cross-validation (for both chosen domains ”Palladium - Hy-
also considered another task to generate ternary compounds drogen” and ”Nickel - Hydrogen”).
”Magnesium - Hydrogen - Titanium”. Our code is implemented in Python (TensorFlow). We
From each system (domain), we have selected 35 crystal run the experiments using GPU with graphics card NVIDIA
structures (stable and metastable) which include experimen- Quadro M5000.
Composition GAN DiscoGAN CrystalGAN CrystalGAN
(standard) without constraints with geometric constraints
Pd - Ni - H 0 0 4 9
Mg - Ti - H 0 0 2 8
Table 2: Number of ternary compositions of good quality generated by the tested methods.
Second, evaluation of a stable structure is not straightfor-
ward. Given a new composition, only the result of density
functional theory (DFT) calculations can provide a conclu-
sion whether this composition is stable enough, and whether
it can be used in practice. However, the DFT calculations are
computationally too expensive, and it is out of question to
run them on all data we generated using the CrystalGAN. In
our work, to avoid the DFT computations, we imply the geo-
metric constraints proposed by the human experience to con-
trol the properties of the generated compounds, such as the
Switendick criterion (Switendick, 1979). It is planned to run
the DFT calculations on some pre-selected generated ternary
compositions to take a final decision on practical utility of
Figure 4: A visualization of a stable structure. the chemical compounds.
The evaluation of generated crystallographic structures
can also be done by laboratory experiments, exploring geo-
Results metric properties of the compositions based on the distances
In our numerical experiments, we compare the proposed between atoms. For example, Figure 4 illustrates a stable
CrystalGAN with a classical GAN, the DiscoGAN Kim et structure in cubic NaCl prototype. Another representation
al. (2017), and the CrystalGAN but without the geometric of a synthesized data is a histogram of the number of nearest
constraints. All these GANs generate POSCAR files, and neighbors at a given distance which forms a pair distribu-
we evaluate the performance of the models by the number tion function (PDF). Figure 6 shows a PDF profile for a sta-
of generated ternary structures which satisfy the geomet- ble structure where the minimal distance between atoms is
ric crystallographic environment. Table 2 shows the num- dmin (A, H) = 2 Å (angstrom) for 6 first nearest neighbours
ber of successes for the considered methods. The classi- (cubic cell parameter is 4Å in this example).
cal GAN which takes Gaussian noise as an input, does not
generate acceptable chemical structures. The DiscoGAN ap-
proach performs quite well if we use it to generate novel Conclusions
pseudo-binary structures, however, it is not adapted to syn- Our goal was to develop a principled approach to gener-
thesize ternary compositions. We observed that the Crystal- ate new ternary stable crystallographic structures from ob-
GAN (with the geometric constraints) outperforms all tested served binary, i.e. containing two chemical elements only.
methods. We propose a learning method called CrystalGAN to dis-
Figure 5 illustrates characteristics of a newly generated cover cross-domain relations in real data, and to generate
ternary (H-Pd-Ni) stable structure: on the left we show the novel structures. The proposed approach can efficiently in-
distances between the nearest neighbours in the crystallo- tegrate, in form of constraints, prior knowledge provided by
graphic structure, and on the right we visualise the generated human experts.
POSCAR file. We would like to underline that the generated CrystalGAN is the first GAN developed to generate scien-
structure respects the geometric constraints. tific data in the field of materials science. To our knowledge,
it is also the first approach which generates data of a higher-
Discussion order complexity, i.e., ternary structures where the domains
Here we provide some important remarks on the task consid- are well-separated from observed binary compounds. The
ered in this contribution. Discovery of stable chemical struc- CrystalGAN was, in particular, successfully tested to tackle
tures in general, and of new materials for hydrogen storage the challenge to discover new materials for hydrogen stor-
in particular, is a challenging task. age.
From multiple discussions with experts in materials sci- Currently, we investigate different GANs architectures,
ence and chemistry, first, we know that the number of novel also including elements of reinforcement learning, to pro-
stable compounds can not be very high, and it is already con- duce data even of a higher complexity, e.g., compounds con-
sidered as a success if we synthesize several stable structures taining four or five chemical elements. Note that although
which satisfy the constraints. Hence, we can not really rea- the CrystalGAN was developed and tested for applications
son in terms of accuracy or error rate which are widely used in materials science, it is a general method where the con-
metrics in machine learning and data mining. straints can be easily adapted to any scientific problem.
Figure 5: The list of the nearest neighbours (on the left); the corresponding generated POSCAR file (on the right).
AH
30 H
25 H A
20
15
Figure 7: The elements included in our data set are high-
A A lighted.
10
H
H A
5 Acknowledgements
0 This work was supported by the French National Research
Agency (ANR JCJC DiagnoLearn).
2.0
2.8284
3.4641
4.0
4.4721
4.899
5.6569
6.0
Nearest neighbors distances
References
Figure 6: Number of nearest neighbors at a given distance Almahairi, A.; Rajeshwar, S.; Sordoni, A.; Bachman, P.; and
for each atom in a structure. Courville, A. 2018. Augmented cycleGAN: Learning
many-to-many mappings from unpaired data. In ICML.
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein
GAN. arXiv:1701.07875.
Bourgeois, N.; Crivello, J.-C.; Cenedese, P.; and Joubert, J.-
M. 2017. Systematic first-principles study of binary metal and Brgoch, J. 2017. Disentangling Structural Confu-
hydrides. ACS Combinatorial Science 19(8):513–523. sion through Machine Learning: Structure Prediction and
Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; and Polymorphism of Equiatomic Ternary Phases ABC. Jour-
Walsh, A. 2018. Machine learning for molecular and nal of the American Chemical Society.
materials science. Nature 559. Ong, S. P.; Richards, W. D.; Jain, A.; Hautier, G.; Kocher,
M.; Cholia, S.; Gunter, D.; Chevrier, V. L.; Persson, K. A.;
Crivello, J.-C.; Dam, B.; Denys, R. V.; Dornheim, M.; Grant,
and Ceder, G. 2012. Python Materials Genomics (py-
D. M.; Huot, J.; Jensen, T. R.; de Jongh, P.; Latroche,
matgen): A robust, open-source python library for materi-
M.; Milanese, C.; Milcius, D.; Walker, G. S.; Webb, C. J.;
als analysis. Computational Materials Science 68 (2013)
Zlotea, C.; and Yartys, V. A. 2016. Review of magnesium
314–319.
hydride-based materials: development and optimisation.
Applied Physics A. Ren, Z., and Lee, Y. J. 2017. Cross-domain self-supervised
multi-task feature learning using synthetic imagery. ArXiv
Durugkar, I.; Gemp, I.; and Mahadevan, S. 2017. Generative
preprint arXiv:1711.09082.
Multi-Adversarial Networks. In International Conference
on Learning Representations (ICLR). Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
ford, A.; and Chen, X. 2016. Improved Techniques for
Friedman, J. H.; Tibshirani, R.; and Hastie, T. 2009. The Training GANs. In Advances in Neural Information Pro-
elements of statistical learning. Springer. cessing Systems 29 (NIPS).
Goh, G. B.; Siegel, C.; Vishnu, A.; and Hodas, N. 2017. Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.;
ChemNet: A Transferable and Generalizable Deep Neu- and Aspuru-Guzik, A. 2017. Optimizing distributions
ral Network for Small-molecule Property Prediction. In over molecular space. An Objective-Reinforced Gener-
NIPS Workshop on Machine Learning for Molecules and ative Adversarial Network for Inverse-design Chemistry
Materials. (ORGANIC). Preprint: chemrxiv:5309668.
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Schmidt, J.; Chen, L.; Botti, S.; and Marques, M. A. L. 2018.
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Predicting the stability of ternary intermetallics with den-
Y. 2014. Generative Adversarial Nets. In Advances in sity functional theory and machine learning. Journal of
Neural Information Processing Systems 27 (NIPS). Chemical Physics 148, 241728 (2018).
Goodfellow, I. 2016. NIPS 2016 Tutorial: Generative Ad- Seko, A.; Hayashi, H.; Kashima, H.; and Tanaka, I. 2018.
versarial Networks. arxiv:1701.00160. Matrix- and tensor-based recommender systems for the
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and discovery of currently unknown inorganic compounds.
Courville, A. 2017. Improved Training of Wasserstein Phys. Rev. Materials 2, 013805.
GANs. In Advances in Neural Information Processing Seko, A.; Togo, A.; and Tanaka, I. 2017. Descriptors for
Systems 30 (NIPS). Machine Learning of Materials Data. arXiv:1709.01666.
Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative Swann, E.; Sun, B.; Cleland, D. M.; and Barnard, A. 2018.
Filtering for Implicit Feedback Datasets. In ICDM. Representing molecular and materials data for unsuper-
Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. vised machine learning. Molecular simulation.
Image-to-Image Translation with Conditional Adversarial Switendick, A. 1979. Band structure calculation for metal
Networks. In Computer Vision and Pattern Recognition hydrogen systems. Z. Phys. Chem NF 117:89.
(CVPR), 2017 IEEE Conference on.
Villars, P., and Cenzual, K., eds. 2017. Pearson’s Crys-
Janz, D.; van der Westhuizen, J.; Paige, B.; Kusner, M. J.; tal Data Crystal Structure Database for Inorganic Com-
and Hernández-Lobato, J. M. 2017. Learning a Gener- pounds. ASM International.
ative Model for Validity in Complex Discrete Structures. Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; and Tsuda,
In NIPS Workshop on Machine Learning for Molecules K. 2017. ChemTS: an efficient Python library for de
and Materials. novo molecular generation. Communications in materials
Kim, T.; Cha, M.; Kim, H.; Lee, J. K.; and Kim, J. 2017. informatics.
Learning to Discover Cross-Domain Relations with Gen- Yazdani, M. 2017. RemixNet: Generative Adversarial Net-
erative Adversarial Networks. In Proceedings of the 34th works for Mixing Multiple Inputs. In Semantic Comput-
International Conference on Machine Learning. ing (ICSC).
Kresse, G., and Joubert, D. 1999. From ultrasoft pseudopo- Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017.
tentials to the projector augmented-wave method. Physi- Unpaired Image-to-Image Translation using Cycle-
cal Review B. Consistent Adversarial Networks. In Computer Vision
Mirza, M., and Osindero, S. 2014. Conditional Generative (ICCV), 2017 IEEE International Conference on.
Adversarial Nets. arXiv:1411.1784.
Oliynyk, A. O.; Adutwum, L. A.; Rudyk, B. W.; Pisava-
dia, H.; Lotfi, S.; Hlukhyy, V.; Harynuk, J. J.; Mar, A.;