1 Introduction

Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design

0 2

kis@imet.ac.ru 1 2

1 2

0 2

0 2 0 Federal Research Center “Computer Science and Control“ of the Russian Academy of Sciences , Moscow , Russia 1 Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials Science RAS , Moscow , Russia 2 Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018) , Moscow , Russia

152 156

Various machine learning methods («Recognition» package and «Scikit-learn» package for Python) accuracy comparison was made on example of inorganic chemistry tasks solution. The crossvalidation and the ROC-analysis were applied to accuracy estimation.

1 Introduction

Machine learning (ML) methods are widely used in the inorganic compounds formation predicting and their properties estimation [ 1-7 ]. The paper [ 5 ] contains a statistical analysis of popularity of various ML methods that applied to inorganic materials science. However, in spite of these methods success for numerous tasks solution in this subject field, no effort of accuracy comparison of wide variety of methods was made using ROC-analysis.

To solve this task the subject field particularities must be taken into account. In particular, it is obvious that an attribute description has a composite structure: the set of chemical elements parameters (the components of an inorganic substance) is repeated as many times as there are elements included into the compound. Due to periodical dependence of chemical elements properties on their atomic numbers the strong correlation within sets of each component parameters is observed. Relative informativeness of individual element’s properties is low. For this reason, the simpler compounds properties (e.g., simple oxides, halogenides, chalcogenides, etc.) as well as the algebraic functions of components’ properties are used. Although these parameters are studied very well but there are gaps of properties’ values (incomplete data). They are filled in a variety of ways. For example, the periodic dependences of elements’ parameters on their atomic numbers and the appropriate interpolation and extrapolation are used. The large asymmetry of training sample sizes for different classes is a peculiarity in inorganic chemistry tasks. Very often the least representative classes (as a rule – newly discovered classes of substances) are the most interesting point to chemists. The experimental errors and discrepancies of inorganic compounds classification in training samples are yet another problem at compound design that decreases a prediction accuracy drastically. Doubtless, that an accuracy depends on attribute description informativeness and training sample representativeness. Therefore, to evaluate various ML methods we have chosen a number of tasks with highly reliable predictions (more than 85 % according to the later experimental verification) [ 6, 7 ].

2 Prediction accuracy estimation methods

The cross-validation (CV) on training sample of objects is the most widely used universal and reliable tool for machine learning quality estimation. At that a number of recognition error can be taken into account. However, one of the problems in ML accuracy estimation task is the recognition efficiency determination in the asymmetrical classes case where the number of different classes objects differs significantly. This situation is very common in cases when only a very few new materials were obtained with the important practical properties and a search for analogues of these substances that are not yet synthesized allows an experimental researches time and cost reduction. In the majority of ML methods application cases the standard decision rule minimizes the total number of erroneous predictions. It results in good recognition of compounds from the large class and in bad recognition of substances representing small class. As a result, the overall recognition accuracy gives poor notion of the efficiency of one or another method or one or another attribute description. The Receiver Operating Characteristic (ROC) analysis application is an alternative approach. It allows a recognition accuracy comparison for the targeted and alternative classes at variation of cut-offs which identifies belonging to different classes.

The following prediction accuracy estimation procedures were used in this analysis fulfilling. The available training sample is divided into two nonintersecting stratified subsamples which were later used to train and assess simple and collective methods independently. Further, the ROC-analysis is carried-out and the Area Under Curve (AUC) measure is calculated. As a rule, in collective decision making the methods with AUC more than some fixed threshold value is used in 3 The test tasks 3.1 Prediction of formation of compounds with the composition A2BCHal6 (A and C are various monovalent metals; B are trivalent metals; and Hal is F, Cl, or Br) [ 7 ]. 2 classes: 4. formation of the compound – 744 examples; 5. nonformation of the compound – 170 examples.

137 attributes including 3 the most informative algebraic functions of the initial attributes. 3.2 Prediction of formation and crystal structure type of compounds with composition A2BCHal6 [ 7 ]. 4 classes: 1. elpasolites – 283 examples; 2. compounds with the Cs2NaCrF6 crystal structure type – 19 examples; 3. another crystal structure types – 57 examples; 4. nonformation of the compound – 83 examples.

134 attributes. 3.3 Prediction of formation of compounds with the composition ABHal3 (A are various monovalent metals; B are bivalent metals; Hal is F, Cl, Br, or I) [ 6 ]. 2 classes: 1. formation of the compound – 237 examples; 2. nonformation of the compound – 107 examples.

88 attributes. 3.4 Prediction of formation and crystal structure type of compounds with composition ABHal3 [ 6 ]. 6 classes: 1. perovskites – 46 examples; 2. compounds with the GdFeO3 crystal structure type – 20 examples; 3. compounds with the CsNiCl3 crystal structure type – 38 examples; 4. compounds with the NH4CdCl3 crystal structure type – 23 examples; 5. another crystal structure types – 39 examples; 6. nonformation of the compound – 111 examples.

88 attributes.

The most important attribute sets were selected using the program based on the method [ 8 ].

4 The analysis of obtained results

The Table 1 contains the efficiency estimation results for single machine learning methods. The following algorithms notations were used (“Recognition” package [ 9 ]): • ECA – the estimates calculation algorithm (fixed size of support sets = 1), Leave-One-Out CV (LOOCV); • •

SBT – the search for the best test (maximal number of ε- thresholds for one attribute = 5; maximal size of sample = 20; number of samples of the same size = 3; percent of tests using in recognition – 10 %; unitary weights), LOOCV; TLS – the two-dimensional linear separators method (bias step – 0; right part components – 0.1; number of iterations – 10000; number of start iteration – 100; percentage of removed objects – 1; step – 100; threshold of regularity selection – 80 %), 10-fold CV; BDT – the binary decision tree learning (maximal number of nodes (interior nodes) – 15; minimal significant value of entropy reduction – 0.2; minimal number of objects in leaf nodes – 5), LOOCV; LDF – the linear Fisher discriminant (confidence threshold for correlation coefficient – 0), LOOCV; LM – the linear machine method (bias step – 0; right part components – 0.1; number of iterations – 10000; number of start iteration – 100; percentage of excluding objects – 1; step – 100), LOOCV; LoReg – the voting algorithm where estimations for classes are calculated with the help of voting by logical regularities system (“greedy” way; number of intervals - 5; maximal number of iterations – 100000; beginning of removal – 100; percentage of removed inequalities – 1%; removal step– 100; minimal rate of objects – 0.1; number of random permutations – 3), 10-fold CV; MNN – the multiplicative neural network algorithm (number of iterations – 1000), LOOCV; MP – the multilayer perceptron (neural network configuration: number of hidden layers – 3 (number of neurons in layer - 10); number of training iterations – 3000; activation function – sigmoid; training speed – 0.1; moment of inertia – 0; lack of criterion function increase if there is no increase during last 1000 iterations then the speed is decreased in 2 times), 10fold CV; ANN – the artificial neural network learning using back-propagation (neural network configuration: number of hidden layers – 3 (number of neurons in layer - 10); number of training iterations – 500; activation function – sigmoid; training speed – 0.1; threshold – 0.1; lack of criterion function increase if there is no increase during last 100 iterations then the speed is decreased in 2 times), 10-fold CV; KNN – the k-nearest neighbors method (number of nearest neighbors – 1; prior class probabilities are took into account), LOOCV; SVM – the support vector machine (penalty coefficient – 5; kernel function type – Gaussian; kernel function parameter – 6; maximal number of iterations – 500, 10-fold CV); SWS – the statistical weighted syndromes (rapid mode; number of partition borders – 1; optimized criteria threshold – 4.5; representativeness threshold – 0.5; instability threshold - 0.2; denial zone – 0.1), 10-fold CV; DTA – the deadlock test algorithm (test searching algorithm – effective; divisor of ε- thresholds = 2; • • • • • • • • • • • • • • • • • • • • • • • • • • • • maximal size of sample = 20; the number of subsamples of the same size = 3), LOOCV. “Scikit-learn package for Python” [ 10 ] - 10-fold CV: LIR – linear_model.LinearRegression; R – linear_model.Ridge; L – linear_model.Lasso; EN – linear_model.ElasticNet; LL – linear_model.LassoLars; OMP – linear_model.OrthogonalMatchingPursuit; BR – linear_model. BayesianRidge; HR –linear_model. HuberRegressor; KR – KernelRidge; PLS – PLSRegression; SGDC – linear_model.SGDClassifier; P –linear_model.Perceptron; PACH – the passive aggressive classifier (loss='hinge'); PACS – the passive aggressive classifier (loss='squared_hinge'); LSVC – linear SVC; NSVC1 – nuSVC (nu=0.1); NSVC3 – nuSVC (nu=0.3); LR – linear_model.LogisticRegression; GPC – Gaussian process classifier; GNB – Gaussian naive Bayes; DTC – tree.DecisionTreeClassifier; KNN – KNeighborsClassifier (n_neighbors=5); MP – neural_network.MLPClassifier; BC – ensemble.BaggingClassifier; RFC – ensemble.RandomForestClassifier; ETC – ensemble.ExtraTreesClassifier; ABC – ensemble.AdaBoostClassifier;

GBC – ensemble.GradientBoostingClassifier.

Algorithm

ANN KNN LM MNN TLS LDF MP SBT ECA BDT 0.845 0.822 0.816 0.804 0.801 0.799 0.788 0.772 0.767 0.737 0.733 0.700 0.675 0.607 0.865 0.857 0.850 0.847 0.843 0.836 0.832 0.803 0.780 0.756 0.742 0.725 0.684 0.959 0.951 0.948 0.945 0.935 0.930 0.927 0.911 0.907 0.905 0.905 0.902 0.900 0.895 0.895 0.886 0.880

Algorithm

HR PACH PACS SGDC PLS

P DTC GNB EN LL L 0.935 0.935 0.931 0.925 0.923 0.917 0.917 0.917 0.916 0.916 0.916 0.913 0.913 0.913 0.912 0.911 0.910 0.905 0.905 0.899 0.881 0.860 0.856 0.831 0.813 0.5 0.5 0.5

The Table 2 includes the results of efficiency of algorithms ensembles methods estimation. The following notations of algorithms were used (“Recognition” package [ 8 ]): • AC – the algebraic corrector (quadratic merit functional; minimal mean deviation = 0); • CS – the convex stabilizer (function type –

Gaussian); • WD – the Woods dynamic method (number of objects in vicinity = 10); • CCA – the complex committee method⎯averaging; • CCM – the complex committee method⎯majority voting; • BM – the Bayes method; • CAS – the clustering and selection (number of clusters = 3); • • •

Algorithm CV accuracy, % AUC

VCH 86.6 0.836

The collective decision-making methods use algorithms for which AUC-values were marked by boldfaced types (see Table 1). We used «default option»-mode for choosing of algorithms parameter values.

It should be noted that in most cases the choice of the most accurate single ML methods using the crossvalidation and the ROC-analysis coincides. The best algorithms (according to AUC-value) (see Table 1) are methods based on the support vector machine (SVM), the deadlock test (DTA), the artificial neural network learning (ANN), as well as the linear machine (LM), the statistical weighted syndromes (SWS), and the twodimensional linear separators (TLS). The Gradient Boosting (GBC) crowds the top of the list in Scikit-learn package. The worst algorithms are the binary decision tree learning (BDT), the search for the best test (SBT), the linear Fisher discriminant (LDF), the Elastic Net (EN), and the Lasso (L and LL).

The most efficient algorithms ensembles (see Table 2) are the complex committee method⎯averaging (CCA), the logic corrector (LC), the generalized polynomial corrector (GPC), and the voting (VC). In most cases the algorithms ensembles application allows a prediction accuracy increase.

5 Conclusions

The problem of the most accurate algorithms selection belongs to the most important tasks of ML. To solve this task the subject field peculiarities must be taken into account. In this research the ML-software from «Recognition» and «Scikit-learn» packages were tested in inorganic compounds prediction tasks. As a rule, small sizes of training samples in these tasks do not allow a selection of representative objects subset for examinational recognition. In that context the crossvalidation using training sample is the most acceptable procedure for ML algorithms accuracy estimation. The substantial difference in numbers of different classes of objects is a peculiarity of inorganic chemical tasks. Therefore, the ROC-analysis is the most acceptable method for these algorithms accuracy evaluation.

Acknowledgments. This work was partially supported by the Russian Foundation for Basic Research (project nos. 17-07-01362 and 18-07-00080) and State assignments No. 007-00129-18-00 and 0063-2020-0003.

[1]

N.N.

Kiselyova . Komp'yuternoe konstruirovanie neorganicheskikh soedinenii. Ispol'zovanie baz dannykh i metodov iskusstvennogo intellekta (Computer Design of Inorganic Compounds: Use of Databases and Artificial Intelligence Methods) . Moscow: Nauka. 2005 .

[2]

N.Y.

Chen ,

W.C.

Lu ,

Yang ,

G.Z.

Li , Support vector machine in chemistry . Singapore: World Scientific Publishing Co. Pte. Ltd . 2004 .

[3]

N.N.

Kiselyova . Computer design of materials with artificial intelligence methods . In Intermetallic Compounds. Principles and Practice , Vol. 3 , Westbrook , J.H. & Fleischer , R.L. eds., p. 811 - 839 , Chichester, UK: John Wiley&Sons, Ltd. 2002 .

[4]

Mueller ,

A.G.

Kusne ,

Ramprasad . Machine Learning in Materials Science. Recent Progress and Emerging Applications. Reviews in Computational Chemistry , 29 , p. 186 - 273 , 2016 .

[5]

N.N.

Kiselyova ,

A.V.

Stolyarenko ,

V.A.

Dudarev . Machine Learning Methods Application to Search for Regularities in Chemical Data. Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017 ). Moscow, Russia, October 9- 13 , 2017 . CEUR Workshop Proceedings, v. 2022 , p. 375 - 380 , 2017 . http://ceur-ws. org/ Vol-2022/paper57.pdf.

[6]

N.N.

Kiseleva . Prediction of the new compounds in the systems of halogenides of the univalent and bivalent metals . Russian Journal of Inorganic Chemistry , 59 ( 5 ), p. 496 - 502 , 2014 .

[7]

N.N.

Kiselyova ,

A.V.

Stolyarenko ,

V.V.

Ryazanov ,

O.V.

Sen

'ko,

A.A.

Dokukin . Prediction of New Halo-Elpasolites. Russian Journal of Inorganic Chemistry . 61 ( 5 ), p. 604 - 609 , 2016 .

[8]

O.V.

Senko . An Optimal Ensemble of Predictors in Convex Correcting Procedures . Pattern Recognition and Image Analysis . 19 ( 3 ), p. 465 - 468 , 2009 .

[9]

Yu. I.

Zhuravlev ,

V. V.

Ryazanov , and

O. V.

Sen 'ko . RECOGNITION. Mathematical methods. Software system. Practical solutions . Moscow: Phasis. 2006 .

[10] Pedregosa et al. Scikit-learn: Machine Learning in Python, JMLR 12 , pp. 2825 - 2830 , 2011 .