=Paper=
{{Paper
|id=Vol-2022/paper57
|storemode=property
|title=
Machine Learning Methods Application to Search for Regularities in Chemical Data
|pdfUrl=https://ceur-ws.org/Vol-2022/paper57.pdf
|volume=Vol-2022
|authors=Nadezhda N. Kiselyova,Andrey V. Stolyarenko,Victor A. Dudarev
|dblpUrl=https://dblp.org/rec/conf/rcdl/KiselyovaSD17
}}
==
Machine Learning Methods Application to Search for Regularities in Chemical Data
==
        Machine Learning Methods Application to Search for
                  Regularities in Chemical Data
                       © N.N. Kiselyova1, ©A.V. Stolyarenko1 , ©V.A. Dudarev1,2
  1Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials
                                     Science RAS (IMET RAS), Moscow
     2National Research University Higher School of Economics (NRU HSE), Moscow, Russia
                        kis@imet.ac.ru         stol-drew@yandex.ru                vic@imet.ac.ru
           Abstract. The possibility of searching for classification regularities in large arrays of chemical
     information by means of machine learning methods is discussed. Tasks peculiarities in inorganic chemistry
     and materials science are considered. The short review of these methods applications to inorganic chemistry
     and materials science is presented. The system for computer-assisted inorganic compounds design based on
     machine learning methods has been developed. The developed system usage makes it possible to predict
     new inorganic compounds and estimate some of their properties without experimental synthesis. The results
     of this information-analytical system application to inorganic compounds design are promising for new
     materials search.
           Keywords: machine learning, database, inorganic chemistry, design of inorganic compounds.
                                                                   approach to machine learning use to search for
 1 Introduction                                                    classifying regularities that allows a prediction of new
                                                                   inorganic compounds and some of their properties
 Throughout the centuries of its evolution chemistry and           estimation [1, 2]. The machine learning methods
 materials science accumulated huge information. In                application [3, 4] allowed new binary compounds
 common with other experimental sciences chemistry                 prediction with 90% reliability knowing constituent
 got through several stages: information accumulation,             chemical elements properties only. The success of
 data analysis and development of classification schemes           approach that was put forward in IMET has given an
 and rules that allow classifying a new object to a                impetus to many investigations which were connected
 particular substances class. The substances division into         with machine learning application to inorganic
 inorganic and organic ones, Periodic table of elements,           chemistry and materials science and carried-out in
 compounds classification according to crystal structure           various countries. The investigations geography in this
 type, etc. are examples of such classifications.                  field is very wide: Europe, America, Asia, Africa
 Essentially, in all cases these classifications are               (figure 1). The most representative teams work in
 imprecise, and classes intersect partially. For example,          Russia, the USA, and China. More detailed reviews of
 organic chemistry is determined as the carbon                     these researches are given in the monograph [5] and
 compounds chemistry but carbides and carbonates                   reviews [6, 7]. It should be noted that in recent years in
 belong to inorganic chemistry objects as well as boron            the developed countries the governmental initiatives
 hydrides (boranes) or silicon hydrides (silanes) which            aimed at IT application (as well as machine learning
 are closer to hydrocarbons (organic chemistry objects)            methods) to chemistry and materials science were
 in many properties. In large measure, it is caused by             announced: Materials Genome Initiative (the USA) [8],
 imperfection in the classification rules which were               Materials Research by Information Integration Initiative
 developed by chemists. One way to get around these                (Japan) [9], and Chinese Materials Genome (China)
 problems in inorganic chemistry and materials science             [10]. It is expected that the theoretic methods use will
 is machine learning methods application to information            provide essential progress achievements in chemistry
 analysis aimed at discovery of complicated classifying            and materials science that will lead to cost reduction
 regularities that allow considering of substances to              during new materials research, development, and
 particular classes. It is noteworthy that obtained                production.
 regularities include substance components properties as
 variables, and for this reason, their use allows us to            2 Problem Statement and Decision Methods
 predict the class for the substances that is not yet
 synthesized knowing only the well-known parameters                Suppose that every inorganic substance is described by
 values for chemical elements forming this substance.              a vector x = (x1(1), x2(1),… xM(1), x1(2), x2(2),… xM(2),…,
     Half a century ago IMET pioneered in applying such            x1(L), x2(L),... xM(L)), where L is the number of chemical
                                                                   elements that form a compound and M is the number of
Proceedings of the XIX International Conference                    chemical elements parameters. Each substance is also
“Data Analytics and Management in Data Intensive                   characterized by a class membership parameter: a(x) ∈
Domains” (DAMDID/RCDL’2017), Moscow, Russia,                       {1, 2,…, K}, where K is the number of classes. The
October 10-13, 2017
                                                             375
learning sample consists of N objects: S = {xi, i = 1, …,          discriminant; LoReg – voting algorithm where
N}. We denote the learning sample objects subset from              estimations for classes are calculated by means of
class aj, j = 1, 2, …, K, by Saj = {x: a(x) = aj}. The             voting by logical regularities system; SWS – statistical
machine learning aim is to construct a classification rule         weighted syndromes; DTA – deadlock test algorithm;
that distinguishes not only different classes objects of           ECA – estimate calculating algorithm.
the learning sample but also preserves prognostic ability              A great diversity of chemical and materials science
to generate new combinations of chemical elements that             tasks were solved successfully using machine learning
were not used for learning.                                        methods, e.g.:
                                                                       theoretic tasks of prediction of:
                                                                       - inorganic system phase diagram type [5, 11];
                                                                       - inorganic compounds formation with certain
                                                                   stoichiometric composition [1, 2, 5-7, 12];
                                                                       - inorganic compounds crystal structure type [5-7,
                                                                   13, 14];
                                                                       - some of inorganic compounds properties (melting
                                                                   point [15], critical temperature of superconductivity
                                                                   [16], band gap energy [17], enthalpy of formation [18],
                                                                   etc.);
                                                                       technologic tasks of prediction of:
                                                                       - mechanical properties of steels [19];
                                                                       - acoustic properties of tellurite glasses [20];
                                                                       - tribological behavior of aluminum–copper based
Figure 1. Distribution of publications related to                  composite [21];
machine learning methods applications to inorganic                     - functional properties of ceramic materials [22], and
chemistry and materials science over the countries.                so on.
    Among the numerous machine learning methods,                   3 Experience in machine learning system
various of Artificial Neural Network (ANN) learning                development for chemical applications
algorithms modifications and Support Vector Machine
                                                                   A special information-analytical system (IAS) that
algorithms (SVM) are the most popular (figure 2). This
                                                                   allows an automation of task solution procedure in the
is due to appropriate software packages accessibility
                                                                   field of inorganic chemistry using machine learning was
and seeming exam score accuracy (many investigators
                                                                   developed in IMET [23]. The subject field peculiarities
do not take into account an influence of overfitting
                                                                   were taken into account at the IAS creation, namely:
effect on subsequent prediction reliability that is
                                                                        1) Attribute description composite structure:
inherent in these methods).
                                                                   chemical elements (inorganic substance components)
                                                                   parameters set is repeated as many times as the number
                                                                   of elements which are included into the compound.
                                                                        2) Strong correlation within set of these attributes
                                                                   for each component due to their dependence on
                                                                   common parameter - chemical elements atomic number
                                                                   (it follows from the Periodic Law).
                                                                        3) Individual chemical elements properties give
                                                                   small informative gain therefore more informative
                                                                   parameters of single compounds (for example, single
                                                                   oxides, halogenides, chalcogenides, etc.) and
                                                                   component properties algebraic functions are widely
                                                                   used additionally.
                                                                        4) Blanks of attributes’ values that are filled by
                                                                   various methods including interpolation taking into
                                                                   account a periodicity in chemical elements properties
                                                                   variation with their atomic numbers.
                                                                        5) Large asymmetry of learning set sizes for
Figure 2. Various machine learning methods popularity              different classes (at that often the least of representative
in inorganic chemistry.                                            – as a rule newly obtained classes of substances – are
                                                                   the most interesting for chemists).
Notation: ANN – artificial neural network learning;                     6) Errors and discrepancies in inorganic compounds
SVM – support vector machine; KNN – k-nearest                      experimental classification of learning set decreases the
neighbors method; DT – decision trees learning; GPN –              prediction accuracy drastically.
concept formation using growing pyramidal networks;                     Machine learning procedure involves several stages:
LM – linear machine method; LDF – linear Fisher                         1) objects selection for machine learning,
                                                             376
    2) attribute description formation (including the              compounds in chemical elements properties space. The
most informative attributes selection and filling                  parameters set includes not only initial attributes but
attribute values blanks also),                                     also the algebraic functions of these attributes which are
    3) machine learning algorithms selection,                      selected by user.
    4) machine learning including application of
algorithms ensembles and collective solution synthesis             3.3 Machine learning algorithms selection
in a case of several algorithms usage,                             The IAS includes a set of machine learning algorithms
    5) machine learning quality estimation,                        which are the most popular among chemists (figure 1).
    6) new objects status prediction and results                   At present time IAS involves the following software:
interpretation.                                                    programs based on well-known linear machines
                                                                   methods, Fisher linear discriminant, k-nearest
3.1 Objects selection for machine learning
                                                                   neighbors, support vector machine, neural-network
Representative and reliable set formation for machine              algorithms, and also algorithms which were developed
learning preconditions subsequent prediction accuracy              by the Computing Centre, Russian Academy of
in a great measure. Objects selection (known inorganic             Sciences and based on estimates calculation, deadlock
substances examples) for machine learning is performed             tests voting algorithms, logical regularities voting
by experts in subject domain by means of information               algorithms, weighted statistical voting algorithms, etc.
stored in data bases (DBs) on inorganic substances and             [26]. IAS includes also the ConFor system for machine
materials properties including DBs that were developed             learning according to procedure for concept formation,
in IMET [17, 23-25]. The latest include data on tens of            developed by the Institute of Cybernetics, National
thousands of substances and are Internet-accessible                Academy of Sciences of Ukraine [27]. This system is
[25]. Data on substances were extracted from thousands             built upon computer memory data arrangement in the
of publications. In common with other intellectual fields          form of growing pyramidal networks. At solution of
papers can involve errors and inaccuracies. The                    each task at hand a selection of the most exact machine
experimental errors in object classification contribute            learning algorithms is carried out for subsequent use in
significantly to prediction accuracy decrease. However,            decision making and prediction procedures.
classification reliability estimation of tens of thousands
of substances is massively expensive and practically               3.4 Machine learning
impossible task. Partial automation of procedure of                Our experience in inorganic chemistry prediction tasks
search for data outliers using machine learning is                 solution shows [6, 7, 12, 17, 23, 24] that algorithms
proposed by us. This can be best done in detecting of              ensembles application allows a considerable increase of
errors which were caused by incorrect and incomplete               accuracy in inorganic compounds prediction. In
experimental knowledge of the class to which the                   decision making process the most accurate machine
substance belongs (for example, crystal structure type)            learning algorithms are used that were selected on the
as well as by erroneous property values of components              previous stage. The IAS includes the following
which form the substance description. In the latter case           programs realizing various collective decisions
errors can be incorrect experimental property value                strategies, which are based on Bayes method, clustering
measurement result or they can be associated with                  and selection methods, decision templates, logical
incorrect interpolation in the case of filling attribute           correction, convex stabilizer method, Woods dynamic
values blanks as well. The machine learning results                method, committee methods, etc. [26].
analysis allows detection of substances which fall
within another class and provision for chemist with                3.5 Machine learning quality estimation
information on substance expert assessment and making
a decision for its status. The problem solution principal          The cross-validation on learning set objects is the most
possibility is specified by the subject domain specific            widely used universal and reliable tool for machine
that is connected with inorganic compounds properties              learning quality estimation. IAS contains special
variation periodicity depending on atomic number of                software for this procedure realization that is used in the
elements – the chemical system components.                         best machine learning algorithms selection. However,
                                                                   an attempt of cross-validation application to machine
3.2 Attribute description formation                                learning accuracy estimation at use of algorithms
                                                                   ensembles as optimizable criterion results the loss of
Attribute description formation problem is complicated             estimate unbiasedness. In this case, there is a certain
and hard-to-solve task of modern machine learning                  overfitting risk. In this regard, the traditional approach
theory. There are a large number of approaches which               to collective algorithms accuracy evaluation using
have proved their effectiveness at various task types              examination recognition of N examples chosen
solution. However, it is impossible to evolve a surely             randomly from learning samples and unused in learning
optimal universal method of attributes selection. In this          (at the final prediction stage, reference examples are
regard few alternative methods with subsequent                     returned to the learning set) is applied. The
collective decision synthesis are used by us for attribute         corresponding program was included to IAS.
selection. 2D-projections visualization tools are applied              The learning set sizes asymmetry for different
additionally for points corresponding to certain type              classes is an important problem at machine learning
                                                             377
accuracy estimation. Naturally in this case the                   of atomization and evaporation; thermal conductivity;
generalized examination recognition accuracy does not             molar heat capacities, etc.), simple A2X3 and B2X
represent the prediction error for small classes,                 chalcogenides properties (standard entropy and
therefore the ROC curves application is appropriate to            enthalpy), and some algebraic functions of these
different algorithms prediction quality analysis. ROC             properties (for example, the ratio of the covalent radius
curves allow recognition accuracy comparison for the              to the metal radius for elements A, B, and X). The table
targeted and alternative classes at variation of cut-offs         1 presents predictions examples for the AB 3X3
which identifies belonging to different classes.                  compounds and their experimental verification results.
    It should be pointed out that machine learning                The following notation is used: 1, the prediction of
quality estimation procedure belongs to yet hardly                AB3X3 formation under normal conditions; 2, the
unsolved machine learning task. Some algorithms                   prediction of AB3X3 absence under normal conditions;
(SVM, ANN, etc.) characterized by overfitting effect,             #, examples, the information on which is used for
show high examination recognition accuracy often but              machine learning; empty cells, uncertain prediction; ©,
this fact does not always provide high predicting                 the prediction of AB3X3 formation matches new
reliability for new objects.                                      experimental data; and Θ, the prediction of compound
                                                                  absence matches experimental data. All 27 tested
3.6 Prediction of new inorganic compounds                         predictions coincided with the experimental data.
formation and some of their properties estimation
To increase predicting accuracy in the case of learning           Table 1. AB3X3 compounds formation possibility
sets with K classes (K > 2) the following method is               prediction
used. Firstly, multi-class learning and prediction are               A Fe Ga In Sn Sb La Ce Pr Nd Sm Eu Gd Tb Dy Bi
carried out. Next, K dichotomies are calculated: the                B
targeted class and all the alternative classes, followed
by subsequent K predictions. The results of multi-class                                  X=S
prediction and dichotomies series are intercompared,                K © #2 1 1 #1 1         1 1 1 1 1 1 1 #2
and if the predictions are not contradictory the decision           Rb ©     #1 1 © 1           1 1 1 1 1 1 #1
on the object status is made. The special tools for                 Tl 1 Θ #1 #1 #1 #2 2 #2 Θ 2 2
collective decision formation based on comparison of
                                                                                         X = Se
multi-class prediction results and dichotomies series
were developed. The efficiency of such approach that                K #1 #1 1 © #1       1         1 1 1 1 1 #1
allows to increase prediction accuracy was approved                 Rb 1 1 1 1 © 1 1               1 1 1 1 1 #1
during numerous tasks solution [5-7, 12, 17, 23, 24].               Ag 2 #2 Θ #2 Θ Θ Θ 2 Θ Θ               Θ 2 Θ
                                                                    Cs 1 #1 1 1 © 1                1 1 1 1 1 #1
4 IAS application illustration to regularities                      Tl 1 Θ #2 #1 #1 2 2 2 2                      #2
search in chemical information
                                                                                         X = Te
The machine learning application allowed a search for               Rb 1 1 1 © 1 1 1               1 1 1 1 1 1
inorganic compounds formation regularities, a
                                                                    Ag 2 #2 #2 Θ #2 2 2 2 2 2 2 #2 Θ Θ Θ
prediction of thousands not yet synthesized substances
and some their properties estimation using obtained                 Cs 1 1 1 © 1 1 1               1 1 1 1 1 1
regularities. This approach efficiency to inorganic                 Tl © Θ Θ #1 #2 #2 2 2 Θ                      #2
compounds design can be illustrated by comparison of
the predictions results with newer experimental data              Conclusions
obtained after publication of our predictions [12].
    The table contains AB3X3 compounds formation                  During half of the century the predictions of thousands
possibility predictions in the A2X3–B2X systems (A and            of inorganic compounds in binary, ternary and more
B are various elements, and X = S, Se, or Te) under               complicated chemical systems were obtained and some
normal conditions, which could be promising for search            their properties (melting point, critical temperature of
for new semiconductor, nonlinear optical, electro-                superconductivity, band gap energy, etc.) were
optical, and acousto-optical materials. Experimental              estimated in IMET [1, 2, 5-7, 12, 16, 17, 23, 24]. The
information on 117 examples of AB 3X3 compounds                   obtained predictions usage allows an essential progress
formed and 58 examples when no such composition                   provision in a search for new magnetic, semiconductor,
compounds were formed in the A2X3–B2X systems                     superconductor, nonlinear optical, electro-optical,
under normal conditions was used for computer                     acousto-optical and other materials. Hundreds of
analysis. To describe the compounds in computer                   predicted compounds were synthesized and our results
memory we selected A, B, and X elements properties                experimental verification shows that the average
(the melting and boiling points; covalent, ionic (by              prediction accuracy is higher than 80% [2, 5-7].
Bokii and Belov), and pseudopotential (by Zunger)                 Machine learning methods application to search for
radii;    the first      three ionization potentials;             regularities in big chemical data gives an opportunity
electronegativity (by Pauling); the standard enthalpies           for theoretic design of new inorganic compounds that
                                                                  allows substantially reduce the costs for search for new
                                                            378
materials with predefined properties, replacing them by              number, J. Alloys and Compounds, 317-318, p.26-
computations. It is important to note that only                      38, 2001.
information on components properties (chemical                  [12] N.N. Kiselyova. Prediction of Formation of
elements or more simple compounds) is used in                        AB3X3 (X = S, Se, Te), Inorg. Mater., 45(10),
prediction process.                                                  p.1077-1080, 2009.
    This work was partially supported by the Russian            [13] A.O. Oliynyk, E. Antono, T.D. Sparks et al. High-
Foundation for Basic Research (project nos. 16-07-                   Throughput Machine-Learning-Driven Synthesis
01028, 17-07-01362, and 15-07-00980). We are                         of Full-Heusler Compounds, Chem. Mater.,
grateful to V.V. Ryazanov, O.V. Sen’ko and                           28(20), p.7324−7331, 2016.
A.A. Dokukin for long-term help and collaboration.
                                                                [14] G. Pilania, P.V. Balachandran, J.E. Gubernatis,
                                                                     and T. Lookman. Classification of ABO3
References                                                           perovskite solids: a machine learning study, Acta
 [1] E. M. Savitskii, Yu. V. Devingtal’, and V. B.                   Crystallogr., B71(5), p.507-513, 2015.
     Gribulya. Prediction of metallic compounds with            [15] A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka.
     composition A3B using computer. Dokl. Akad.                     Machine learning with systematic density-
     Nauk SSSR (English translation - Doklady                        functional theory calculations: Application to
     Physical Chemistry), 183(5), p.1110-1112, 1968                  melting temperatures of single- and binary-
 [2] E. M. Savitskii and V. B. Gribulya. Application of              component solids, Phys. Rev., B89(5),
     computer techniques in the prediction of inorganic              p.054303/1-9, 2014.
     compounds. New Delhi-Calcutta: Oxonian Press               [16] E.M. Savitskii, V.B. Gribulya, and N.N.
     Pvt., Ltd. 1985.                                                Kiselyova. Cybernetic prediction of
 [3] Yu. V. Devingtal’. About optimal coding of                      superconducting compounds, CALPHAD, 3(3),
     objects at their classification using pattern                   p.171-173, 1979.
     recognition methods. Izvestiya Akademii Nauk               [17] N.N. Kiselyova, V.A. Dudarev, M.A. Korzhuyev.
     SSSR. Tekhnicheskaya Kibernetika, 1, p.162-169,                 Database on the Bandgap of Inorganic Substances
     1968.                                                           and Materials, Inorganic Materials: Applied
 [4] Yu. V. Devingtal’. Coding of objects at                         Research, 7(1), p. 34-39, 2016.
     application of separating hyper-plane for their            [18] S.P. Sun, D.Q. Yi, Y. Jiang, et al. Prediction of
     classification. Izvestiya Akademii Nauk SSSR.                   formation enthalpies for Al2X-type intermetallics
     Tekhnicheskaya Kibernetika, 3, p.139-147, 1971.                 using back-propagation neural network, Mater.
 [5] N.N. Kiselyova. Komp’yuternoe konstruirovanie                   Chem. and Phys., 126(3), p. 632–641, 2011.
     neorganicheskikh soedinenii. Ispol’zovanie baz             [19] A. Bahrami, A. S. H. Mousavi, and A. Ekrami.
     dannykh i metodov iskusstvennogo intellekta                     Prediction of mechanical properties of DP steels
     (Computer Design of Inorganic Compounds: Use                    using neural network model, J. Alloys and
     of Databases and Artificial Intelligence Methods).              Compounds, 392(1-2), p.177-182, 2005.
     Moscow: Nauka, 2005.                                       [20] M.S. Gaafar, M.A.M. Abdeen, and S.Y. Marzouk.
 [6] G.S. Burkhanov and N.N. Kiselyova. Prediction of                Structural investigation and simulation of acoustic
     intermetallic compounds, Russ. Chem. Rev.,                      properties of some tellurite glasses using artificial
     78(6), p. 569-587, 2009.                                        intelligence technique, J. Alloys and Compounds,
 [7] N. Kiselyova, A. Stolyarenko, V. Ryazanov, et al.               509, p. 3566-3575, 2011.
     Application of Machine Training Methods to                 [21] M. Hayajneh, A.M. Hassan, A. Alrashdan, and
     Design of New Inorganic Compounds. In                           A.T. Mayyas. Prediction of tribological behavior
     Diagnostic Test Approaches to Machine Learning                  of aluminum–copper based composite using
     and Commonsense Reasoning Systems. Ed. By                       artificial neural network, J. Alloys and
     X.A. Naidenova & D.I. Ignatov. Hershey: IGI                     Compounds, 470, p. 584-588, 2009.
     Global, p. 197-220, 2012.                                  [22] D.J. Scott, P.V. Coveney, J.A. Kilner, et al.
 [8] Site of Materials Genome Initiative:                            Prediction of the functional properties of ceramic
     https://www.mgi.gov/ .                                          materials from composition using artificial neural
 [9] Site of Center for Materials Research by                        networks, J. Eur. Ceram. Soc., 27(16), p. 4425–
     Information Integration:                                        4435, 2007.
     http://www.nims.go.jp/eng/research/MII-                    [23] N.N. Kiselyova, A.V. Stolyarenko, V.V.
     I/index.html .                                                  Ryazanov, et al. A system for computer-assisted
[10] X.-G. Lu. Remarks on the recent progress of                     design of inorganic compounds based on computer
     Materials Genome Initiative, Sci. Bull., 60(22),                training, Pattern Recognition and Image Analysis,
     p.1966–1968, 2015.                                              21(1), p. 88-94, 2011.
[11] P. Villars, K. Brandenburg, M. Berndt, et al.              [24] N.N. Kiselyova, V.A. Dudarev, and V.S.Zemskov.
     Binary, ternary and quaternary compound                         Computer information resources in inorganic
     former/nonformer prediction via Mendeleev
                                                          379
     chemistry and materials science, Russ. Chem.              Software system. Practical solutions. Moscow:
     Rev., 79(2), p. 145-166, 2010.                            Phasis. 2006.
[25] Site of IMET RAS DBs: http://imet-db.ru .            [27] V. P. Gladun. Processes of formation of new
[26] Yu. I. Zhuravlev, V. V. Ryazanov, and O. V.               knowledge. Sofia: SD "Pedagog 6”. 1995.
     Sen’ko. RECOGNITION. Mathematical methods.
                                                    380