=Paper=
{{Paper
|id=Vol-2022/paper57
|storemode=property
|title=
Machine Learning Methods Application to Search for Regularities in Chemical Data
|pdfUrl=https://ceur-ws.org/Vol-2022/paper57.pdf
|volume=Vol-2022
|authors=Nadezhda N. Kiselyova,Andrey V. Stolyarenko,Victor A. Dudarev
|dblpUrl=https://dblp.org/rec/conf/rcdl/KiselyovaSD17
}}
==
Machine Learning Methods Application to Search for Regularities in Chemical Data
==
Machine Learning Methods Application to Search for
Regularities in Chemical Data
© N.N. Kiselyova1, ©A.V. Stolyarenko1 , ©V.A. Dudarev1,2
1Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials
Science RAS (IMET RAS), Moscow
2National Research University Higher School of Economics (NRU HSE), Moscow, Russia
kis@imet.ac.ru stol-drew@yandex.ru vic@imet.ac.ru
Abstract. The possibility of searching for classification regularities in large arrays of chemical
information by means of machine learning methods is discussed. Tasks peculiarities in inorganic chemistry
and materials science are considered. The short review of these methods applications to inorganic chemistry
and materials science is presented. The system for computer-assisted inorganic compounds design based on
machine learning methods has been developed. The developed system usage makes it possible to predict
new inorganic compounds and estimate some of their properties without experimental synthesis. The results
of this information-analytical system application to inorganic compounds design are promising for new
materials search.
Keywords: machine learning, database, inorganic chemistry, design of inorganic compounds.
approach to machine learning use to search for
1 Introduction classifying regularities that allows a prediction of new
inorganic compounds and some of their properties
Throughout the centuries of its evolution chemistry and estimation [1, 2]. The machine learning methods
materials science accumulated huge information. In application [3, 4] allowed new binary compounds
common with other experimental sciences chemistry prediction with 90% reliability knowing constituent
got through several stages: information accumulation, chemical elements properties only. The success of
data analysis and development of classification schemes approach that was put forward in IMET has given an
and rules that allow classifying a new object to a impetus to many investigations which were connected
particular substances class. The substances division into with machine learning application to inorganic
inorganic and organic ones, Periodic table of elements, chemistry and materials science and carried-out in
compounds classification according to crystal structure various countries. The investigations geography in this
type, etc. are examples of such classifications. field is very wide: Europe, America, Asia, Africa
Essentially, in all cases these classifications are (figure 1). The most representative teams work in
imprecise, and classes intersect partially. For example, Russia, the USA, and China. More detailed reviews of
organic chemistry is determined as the carbon these researches are given in the monograph [5] and
compounds chemistry but carbides and carbonates reviews [6, 7]. It should be noted that in recent years in
belong to inorganic chemistry objects as well as boron the developed countries the governmental initiatives
hydrides (boranes) or silicon hydrides (silanes) which aimed at IT application (as well as machine learning
are closer to hydrocarbons (organic chemistry objects) methods) to chemistry and materials science were
in many properties. In large measure, it is caused by announced: Materials Genome Initiative (the USA) [8],
imperfection in the classification rules which were Materials Research by Information Integration Initiative
developed by chemists. One way to get around these (Japan) [9], and Chinese Materials Genome (China)
problems in inorganic chemistry and materials science [10]. It is expected that the theoretic methods use will
is machine learning methods application to information provide essential progress achievements in chemistry
analysis aimed at discovery of complicated classifying and materials science that will lead to cost reduction
regularities that allow considering of substances to during new materials research, development, and
particular classes. It is noteworthy that obtained production.
regularities include substance components properties as
variables, and for this reason, their use allows us to 2 Problem Statement and Decision Methods
predict the class for the substances that is not yet
synthesized knowing only the well-known parameters Suppose that every inorganic substance is described by
values for chemical elements forming this substance. a vector x = (x1(1), x2(1),… xM(1), x1(2), x2(2),… xM(2),…,
Half a century ago IMET pioneered in applying such x1(L), x2(L),... xM(L)), where L is the number of chemical
elements that form a compound and M is the number of
Proceedings of the XIX International Conference chemical elements parameters. Each substance is also
“Data Analytics and Management in Data Intensive characterized by a class membership parameter: a(x) ∈
Domains” (DAMDID/RCDL’2017), Moscow, Russia, {1, 2,…, K}, where K is the number of classes. The
October 10-13, 2017
375
learning sample consists of N objects: S = {xi, i = 1, …, discriminant; LoReg – voting algorithm where
N}. We denote the learning sample objects subset from estimations for classes are calculated by means of
class aj, j = 1, 2, …, K, by Saj = {x: a(x) = aj}. The voting by logical regularities system; SWS – statistical
machine learning aim is to construct a classification rule weighted syndromes; DTA – deadlock test algorithm;
that distinguishes not only different classes objects of ECA – estimate calculating algorithm.
the learning sample but also preserves prognostic ability A great diversity of chemical and materials science
to generate new combinations of chemical elements that tasks were solved successfully using machine learning
were not used for learning. methods, e.g.:
theoretic tasks of prediction of:
- inorganic system phase diagram type [5, 11];
- inorganic compounds formation with certain
stoichiometric composition [1, 2, 5-7, 12];
- inorganic compounds crystal structure type [5-7,
13, 14];
- some of inorganic compounds properties (melting
point [15], critical temperature of superconductivity
[16], band gap energy [17], enthalpy of formation [18],
etc.);
technologic tasks of prediction of:
- mechanical properties of steels [19];
- acoustic properties of tellurite glasses [20];
- tribological behavior of aluminum–copper based
Figure 1. Distribution of publications related to composite [21];
machine learning methods applications to inorganic - functional properties of ceramic materials [22], and
chemistry and materials science over the countries. so on.
Among the numerous machine learning methods, 3 Experience in machine learning system
various of Artificial Neural Network (ANN) learning development for chemical applications
algorithms modifications and Support Vector Machine
A special information-analytical system (IAS) that
algorithms (SVM) are the most popular (figure 2). This
allows an automation of task solution procedure in the
is due to appropriate software packages accessibility
field of inorganic chemistry using machine learning was
and seeming exam score accuracy (many investigators
developed in IMET [23]. The subject field peculiarities
do not take into account an influence of overfitting
were taken into account at the IAS creation, namely:
effect on subsequent prediction reliability that is
1) Attribute description composite structure:
inherent in these methods).
chemical elements (inorganic substance components)
parameters set is repeated as many times as the number
of elements which are included into the compound.
2) Strong correlation within set of these attributes
for each component due to their dependence on
common parameter - chemical elements atomic number
(it follows from the Periodic Law).
3) Individual chemical elements properties give
small informative gain therefore more informative
parameters of single compounds (for example, single
oxides, halogenides, chalcogenides, etc.) and
component properties algebraic functions are widely
used additionally.
4) Blanks of attributes’ values that are filled by
various methods including interpolation taking into
account a periodicity in chemical elements properties
variation with their atomic numbers.
5) Large asymmetry of learning set sizes for
Figure 2. Various machine learning methods popularity different classes (at that often the least of representative
in inorganic chemistry. – as a rule newly obtained classes of substances – are
the most interesting for chemists).
Notation: ANN – artificial neural network learning; 6) Errors and discrepancies in inorganic compounds
SVM – support vector machine; KNN – k-nearest experimental classification of learning set decreases the
neighbors method; DT – decision trees learning; GPN – prediction accuracy drastically.
concept formation using growing pyramidal networks; Machine learning procedure involves several stages:
LM – linear machine method; LDF – linear Fisher 1) objects selection for machine learning,
376
2) attribute description formation (including the compounds in chemical elements properties space. The
most informative attributes selection and filling parameters set includes not only initial attributes but
attribute values blanks also), also the algebraic functions of these attributes which are
3) machine learning algorithms selection, selected by user.
4) machine learning including application of
algorithms ensembles and collective solution synthesis 3.3 Machine learning algorithms selection
in a case of several algorithms usage, The IAS includes a set of machine learning algorithms
5) machine learning quality estimation, which are the most popular among chemists (figure 1).
6) new objects status prediction and results At present time IAS involves the following software:
interpretation. programs based on well-known linear machines
methods, Fisher linear discriminant, k-nearest
3.1 Objects selection for machine learning
neighbors, support vector machine, neural-network
Representative and reliable set formation for machine algorithms, and also algorithms which were developed
learning preconditions subsequent prediction accuracy by the Computing Centre, Russian Academy of
in a great measure. Objects selection (known inorganic Sciences and based on estimates calculation, deadlock
substances examples) for machine learning is performed tests voting algorithms, logical regularities voting
by experts in subject domain by means of information algorithms, weighted statistical voting algorithms, etc.
stored in data bases (DBs) on inorganic substances and [26]. IAS includes also the ConFor system for machine
materials properties including DBs that were developed learning according to procedure for concept formation,
in IMET [17, 23-25]. The latest include data on tens of developed by the Institute of Cybernetics, National
thousands of substances and are Internet-accessible Academy of Sciences of Ukraine [27]. This system is
[25]. Data on substances were extracted from thousands built upon computer memory data arrangement in the
of publications. In common with other intellectual fields form of growing pyramidal networks. At solution of
papers can involve errors and inaccuracies. The each task at hand a selection of the most exact machine
experimental errors in object classification contribute learning algorithms is carried out for subsequent use in
significantly to prediction accuracy decrease. However, decision making and prediction procedures.
classification reliability estimation of tens of thousands
of substances is massively expensive and practically 3.4 Machine learning
impossible task. Partial automation of procedure of Our experience in inorganic chemistry prediction tasks
search for data outliers using machine learning is solution shows [6, 7, 12, 17, 23, 24] that algorithms
proposed by us. This can be best done in detecting of ensembles application allows a considerable increase of
errors which were caused by incorrect and incomplete accuracy in inorganic compounds prediction. In
experimental knowledge of the class to which the decision making process the most accurate machine
substance belongs (for example, crystal structure type) learning algorithms are used that were selected on the
as well as by erroneous property values of components previous stage. The IAS includes the following
which form the substance description. In the latter case programs realizing various collective decisions
errors can be incorrect experimental property value strategies, which are based on Bayes method, clustering
measurement result or they can be associated with and selection methods, decision templates, logical
incorrect interpolation in the case of filling attribute correction, convex stabilizer method, Woods dynamic
values blanks as well. The machine learning results method, committee methods, etc. [26].
analysis allows detection of substances which fall
within another class and provision for chemist with 3.5 Machine learning quality estimation
information on substance expert assessment and making
a decision for its status. The problem solution principal The cross-validation on learning set objects is the most
possibility is specified by the subject domain specific widely used universal and reliable tool for machine
that is connected with inorganic compounds properties learning quality estimation. IAS contains special
variation periodicity depending on atomic number of software for this procedure realization that is used in the
elements – the chemical system components. best machine learning algorithms selection. However,
an attempt of cross-validation application to machine
3.2 Attribute description formation learning accuracy estimation at use of algorithms
ensembles as optimizable criterion results the loss of
Attribute description formation problem is complicated estimate unbiasedness. In this case, there is a certain
and hard-to-solve task of modern machine learning overfitting risk. In this regard, the traditional approach
theory. There are a large number of approaches which to collective algorithms accuracy evaluation using
have proved their effectiveness at various task types examination recognition of N examples chosen
solution. However, it is impossible to evolve a surely randomly from learning samples and unused in learning
optimal universal method of attributes selection. In this (at the final prediction stage, reference examples are
regard few alternative methods with subsequent returned to the learning set) is applied. The
collective decision synthesis are used by us for attribute corresponding program was included to IAS.
selection. 2D-projections visualization tools are applied The learning set sizes asymmetry for different
additionally for points corresponding to certain type classes is an important problem at machine learning
377
accuracy estimation. Naturally in this case the of atomization and evaporation; thermal conductivity;
generalized examination recognition accuracy does not molar heat capacities, etc.), simple A2X3 and B2X
represent the prediction error for small classes, chalcogenides properties (standard entropy and
therefore the ROC curves application is appropriate to enthalpy), and some algebraic functions of these
different algorithms prediction quality analysis. ROC properties (for example, the ratio of the covalent radius
curves allow recognition accuracy comparison for the to the metal radius for elements A, B, and X). The table
targeted and alternative classes at variation of cut-offs 1 presents predictions examples for the AB 3X3
which identifies belonging to different classes. compounds and their experimental verification results.
It should be pointed out that machine learning The following notation is used: 1, the prediction of
quality estimation procedure belongs to yet hardly AB3X3 formation under normal conditions; 2, the
unsolved machine learning task. Some algorithms prediction of AB3X3 absence under normal conditions;
(SVM, ANN, etc.) characterized by overfitting effect, #, examples, the information on which is used for
show high examination recognition accuracy often but machine learning; empty cells, uncertain prediction; ©,
this fact does not always provide high predicting the prediction of AB3X3 formation matches new
reliability for new objects. experimental data; and Θ, the prediction of compound
absence matches experimental data. All 27 tested
3.6 Prediction of new inorganic compounds predictions coincided with the experimental data.
formation and some of their properties estimation
To increase predicting accuracy in the case of learning Table 1. AB3X3 compounds formation possibility
sets with K classes (K > 2) the following method is prediction
used. Firstly, multi-class learning and prediction are A Fe Ga In Sn Sb La Ce Pr Nd Sm Eu Gd Tb Dy Bi
carried out. Next, K dichotomies are calculated: the B
targeted class and all the alternative classes, followed
by subsequent K predictions. The results of multi-class X=S
prediction and dichotomies series are intercompared, K © #2 1 1 #1 1 1 1 1 1 1 1 1 #2
and if the predictions are not contradictory the decision Rb © #1 1 © 1 1 1 1 1 1 1 #1
on the object status is made. The special tools for Tl 1 Θ #1 #1 #1 #2 2 #2 Θ 2 2
collective decision formation based on comparison of
X = Se
multi-class prediction results and dichotomies series
were developed. The efficiency of such approach that K #1 #1 1 © #1 1 1 1 1 1 1 #1
allows to increase prediction accuracy was approved Rb 1 1 1 1 © 1 1 1 1 1 1 1 #1
during numerous tasks solution [5-7, 12, 17, 23, 24]. Ag 2 #2 Θ #2 Θ Θ Θ 2 Θ Θ Θ 2 Θ
Cs 1 #1 1 1 © 1 1 1 1 1 1 #1
4 IAS application illustration to regularities Tl 1 Θ #2 #1 #1 2 2 2 2 #2
search in chemical information
X = Te
The machine learning application allowed a search for Rb 1 1 1 © 1 1 1 1 1 1 1 1 1
inorganic compounds formation regularities, a
Ag 2 #2 #2 Θ #2 2 2 2 2 2 2 #2 Θ Θ Θ
prediction of thousands not yet synthesized substances
and some their properties estimation using obtained Cs 1 1 1 © 1 1 1 1 1 1 1 1 1
regularities. This approach efficiency to inorganic Tl © Θ Θ #1 #2 #2 2 2 Θ #2
compounds design can be illustrated by comparison of
the predictions results with newer experimental data Conclusions
obtained after publication of our predictions [12].
The table contains AB3X3 compounds formation During half of the century the predictions of thousands
possibility predictions in the A2X3–B2X systems (A and of inorganic compounds in binary, ternary and more
B are various elements, and X = S, Se, or Te) under complicated chemical systems were obtained and some
normal conditions, which could be promising for search their properties (melting point, critical temperature of
for new semiconductor, nonlinear optical, electro- superconductivity, band gap energy, etc.) were
optical, and acousto-optical materials. Experimental estimated in IMET [1, 2, 5-7, 12, 16, 17, 23, 24]. The
information on 117 examples of AB 3X3 compounds obtained predictions usage allows an essential progress
formed and 58 examples when no such composition provision in a search for new magnetic, semiconductor,
compounds were formed in the A2X3–B2X systems superconductor, nonlinear optical, electro-optical,
under normal conditions was used for computer acousto-optical and other materials. Hundreds of
analysis. To describe the compounds in computer predicted compounds were synthesized and our results
memory we selected A, B, and X elements properties experimental verification shows that the average
(the melting and boiling points; covalent, ionic (by prediction accuracy is higher than 80% [2, 5-7].
Bokii and Belov), and pseudopotential (by Zunger) Machine learning methods application to search for
radii; the first three ionization potentials; regularities in big chemical data gives an opportunity
electronegativity (by Pauling); the standard enthalpies for theoretic design of new inorganic compounds that
allows substantially reduce the costs for search for new
378
materials with predefined properties, replacing them by number, J. Alloys and Compounds, 317-318, p.26-
computations. It is important to note that only 38, 2001.
information on components properties (chemical [12] N.N. Kiselyova. Prediction of Formation of
elements or more simple compounds) is used in AB3X3 (X = S, Se, Te), Inorg. Mater., 45(10),
prediction process. p.1077-1080, 2009.
This work was partially supported by the Russian [13] A.O. Oliynyk, E. Antono, T.D. Sparks et al. High-
Foundation for Basic Research (project nos. 16-07- Throughput Machine-Learning-Driven Synthesis
01028, 17-07-01362, and 15-07-00980). We are of Full-Heusler Compounds, Chem. Mater.,
grateful to V.V. Ryazanov, O.V. Sen’ko and 28(20), p.7324−7331, 2016.
A.A. Dokukin for long-term help and collaboration.
[14] G. Pilania, P.V. Balachandran, J.E. Gubernatis,
and T. Lookman. Classification of ABO3
References perovskite solids: a machine learning study, Acta
[1] E. M. Savitskii, Yu. V. Devingtal’, and V. B. Crystallogr., B71(5), p.507-513, 2015.
Gribulya. Prediction of metallic compounds with [15] A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka.
composition A3B using computer. Dokl. Akad. Machine learning with systematic density-
Nauk SSSR (English translation - Doklady functional theory calculations: Application to
Physical Chemistry), 183(5), p.1110-1112, 1968 melting temperatures of single- and binary-
[2] E. M. Savitskii and V. B. Gribulya. Application of component solids, Phys. Rev., B89(5),
computer techniques in the prediction of inorganic p.054303/1-9, 2014.
compounds. New Delhi-Calcutta: Oxonian Press [16] E.M. Savitskii, V.B. Gribulya, and N.N.
Pvt., Ltd. 1985. Kiselyova. Cybernetic prediction of
[3] Yu. V. Devingtal’. About optimal coding of superconducting compounds, CALPHAD, 3(3),
objects at their classification using pattern p.171-173, 1979.
recognition methods. Izvestiya Akademii Nauk [17] N.N. Kiselyova, V.A. Dudarev, M.A. Korzhuyev.
SSSR. Tekhnicheskaya Kibernetika, 1, p.162-169, Database on the Bandgap of Inorganic Substances
1968. and Materials, Inorganic Materials: Applied
[4] Yu. V. Devingtal’. Coding of objects at Research, 7(1), p. 34-39, 2016.
application of separating hyper-plane for their [18] S.P. Sun, D.Q. Yi, Y. Jiang, et al. Prediction of
classification. Izvestiya Akademii Nauk SSSR. formation enthalpies for Al2X-type intermetallics
Tekhnicheskaya Kibernetika, 3, p.139-147, 1971. using back-propagation neural network, Mater.
[5] N.N. Kiselyova. Komp’yuternoe konstruirovanie Chem. and Phys., 126(3), p. 632–641, 2011.
neorganicheskikh soedinenii. Ispol’zovanie baz [19] A. Bahrami, A. S. H. Mousavi, and A. Ekrami.
dannykh i metodov iskusstvennogo intellekta Prediction of mechanical properties of DP steels
(Computer Design of Inorganic Compounds: Use using neural network model, J. Alloys and
of Databases and Artificial Intelligence Methods). Compounds, 392(1-2), p.177-182, 2005.
Moscow: Nauka, 2005. [20] M.S. Gaafar, M.A.M. Abdeen, and S.Y. Marzouk.
[6] G.S. Burkhanov and N.N. Kiselyova. Prediction of Structural investigation and simulation of acoustic
intermetallic compounds, Russ. Chem. Rev., properties of some tellurite glasses using artificial
78(6), p. 569-587, 2009. intelligence technique, J. Alloys and Compounds,
[7] N. Kiselyova, A. Stolyarenko, V. Ryazanov, et al. 509, p. 3566-3575, 2011.
Application of Machine Training Methods to [21] M. Hayajneh, A.M. Hassan, A. Alrashdan, and
Design of New Inorganic Compounds. In A.T. Mayyas. Prediction of tribological behavior
Diagnostic Test Approaches to Machine Learning of aluminum–copper based composite using
and Commonsense Reasoning Systems. Ed. By artificial neural network, J. Alloys and
X.A. Naidenova & D.I. Ignatov. Hershey: IGI Compounds, 470, p. 584-588, 2009.
Global, p. 197-220, 2012. [22] D.J. Scott, P.V. Coveney, J.A. Kilner, et al.
[8] Site of Materials Genome Initiative: Prediction of the functional properties of ceramic
https://www.mgi.gov/ . materials from composition using artificial neural
[9] Site of Center for Materials Research by networks, J. Eur. Ceram. Soc., 27(16), p. 4425–
Information Integration: 4435, 2007.
http://www.nims.go.jp/eng/research/MII- [23] N.N. Kiselyova, A.V. Stolyarenko, V.V.
I/index.html . Ryazanov, et al. A system for computer-assisted
[10] X.-G. Lu. Remarks on the recent progress of design of inorganic compounds based on computer
Materials Genome Initiative, Sci. Bull., 60(22), training, Pattern Recognition and Image Analysis,
p.1966–1968, 2015. 21(1), p. 88-94, 2011.
[11] P. Villars, K. Brandenburg, M. Berndt, et al. [24] N.N. Kiselyova, V.A. Dudarev, and V.S.Zemskov.
Binary, ternary and quaternary compound Computer information resources in inorganic
former/nonformer prediction via Mendeleev
379
chemistry and materials science, Russ. Chem. Software system. Practical solutions. Moscow:
Rev., 79(2), p. 145-166, 2010. Phasis. 2006.
[25] Site of IMET RAS DBs: http://imet-db.ru . [27] V. P. Gladun. Processes of formation of new
[26] Yu. I. Zhuravlev, V. V. Ryazanov, and O. V. knowledge. Sofia: SD "Pedagog 6”. 1995.
Sen’ko. RECOGNITION. Mathematical methods.
380